Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot resume training without quality loss #30

Closed
AndroYD84 opened this issue Jan 6, 2020 · 5 comments
Closed

Cannot resume training without quality loss #30

AndroYD84 opened this issue Jan 6, 2020 · 5 comments

Comments

@AndroYD84
Copy link

Due to unfortunate circumstances my training process has terminated abruptly, any attempt to resume it has resulted in a model that was much worse than before the interruption happened, even after days of training it doesn't seem to get back to the quality it used to be, as if it was unrecoverable, I attempted 2 different methods:

  1. I edit this line in train.py with the lastest checkpoint number (ie. "123456" if checkpoint filename is "checkpoint_123456"), I checked the log with tensorboard but it doesn't seem to contain any information about the latest epoch so I leave it to 0 (I guess that is purely cosmetical and only resets the epoch counter to 0 but results shouldn't be affected, right?), I make a backup and run
    python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
    just like when I began training and looks like it resumes from that latest iteration I pointed it to, but results are way worse than it used to be even after 3 days of training.
  2. I revert train.py as it was originally and begin training from scratch, but warm starting using my latest model (mylatestmodel.pt) instead of the provided LibriTTS pretrained model (mellotron_libritts.pt) from this repo, so I run
    python train.py --output_directory=outdir --log_directory=logdir -c models/mylatestmodel.pt --warm_start
    but the generated results sound even worse than (1) after 2 days of training.
    Is it actually possible to resume an interrupted training back to it's tracks? If so, what is the correct method? It can be quite frustating losing days/hours of training because of an incident beyond our control.
    I think a console logger could be a useful addition too, if the terminal window gets closed unexpectedly you'd still have a record of it, in my case, it would have been useful to check how many epochs it reached before the training was suddenly interrupted, even if it was purely cosmetical.
@rafaelvalle
Copy link
Contributor

By default a checkpoint is saved every 500 iterations.

@AndroYD84
Copy link
Author

AndroYD84 commented Jan 8, 2020

Yes, but that's not the issue, imagine I trained for weeks up to 300.400 iterations and a black out happened, I'd lose only 400 iterations of progress but still have a "checkpoint_300000" file, is it possible to resume training from this checkpoint? Any attempt I made to resume from a checkpoint have generated a model that sounded much worse than its predecessor (checkpoint_300000" file), I know sometimes resuming a training requires some warming up before returning to its original state, but this isn't happening after a week, the results are not even close to its predecessor, if I had a time machine and could have prevented the blackout now the new checkpoint (ie. checkpoint_400000) would have sounded better not worse than before, do I have to start over again from scratch and lose weeks of training or I did something wrong? Thanks for your patience.

@texpomru13
Copy link
Contributor

@AndroYD84 try changing the hyperparameter before resuming from checkpoint: ignore_layers=[] and use_saved_learning_rate=True

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Jan 8, 2020

Note that when using --warm_start does not include the optimizer.
When resuming from your own model. you should not include --warm_start.

@AndroYD84
Copy link
Author

Thank you for your help, I resumed training without the "warm_start" option and I confirm that so far I haven't noticed any quality loss, I haven't tried texpomru13 solution as the results were already improving without the need to change anything.
However, if at some point I notice that the model is not improving any further then I plan to test the other solution as well, but right now I didn't want to jinx it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants