Cannot resume training without quality loss #30

AndroYD84 · 2020-01-06T07:18:53Z

Due to unfortunate circumstances my training process has terminated abruptly, any attempt to resume it has resulted in a model that was much worse than before the interruption happened, even after days of training it doesn't seem to get back to the quality it used to be, as if it was unrecoverable, I attempted 2 different methods:

I edit this line in train.py with the lastest checkpoint number (ie. "123456" if checkpoint filename is "checkpoint_123456"), I checked the log with tensorboard but it doesn't seem to contain any information about the latest epoch so I leave it to 0 (I guess that is purely cosmetical and only resets the epoch counter to 0 but results shouldn't be affected, right?), I make a backup and run
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
just like when I began training and looks like it resumes from that latest iteration I pointed it to, but results are way worse than it used to be even after 3 days of training.
I revert train.py as it was originally and begin training from scratch, but warm starting using my latest model (mylatestmodel.pt) instead of the provided LibriTTS pretrained model (mellotron_libritts.pt) from this repo, so I run
python train.py --output_directory=outdir --log_directory=logdir -c models/mylatestmodel.pt --warm_start
but the generated results sound even worse than (1) after 2 days of training.
Is it actually possible to resume an interrupted training back to it's tracks? If so, what is the correct method? It can be quite frustating losing days/hours of training because of an incident beyond our control.
I think a console logger could be a useful addition too, if the terminal window gets closed unexpectedly you'd still have a record of it, in my case, it would have been useful to check how many epochs it reached before the training was suddenly interrupted, even if it was purely cosmetical.

The text was updated successfully, but these errors were encountered:

rafaelvalle · 2020-01-08T03:27:45Z

By default a checkpoint is saved every 500 iterations.

AndroYD84 · 2020-01-08T09:38:39Z

Yes, but that's not the issue, imagine I trained for weeks up to 300.400 iterations and a black out happened, I'd lose only 400 iterations of progress but still have a "checkpoint_300000" file, is it possible to resume training from this checkpoint? Any attempt I made to resume from a checkpoint have generated a model that sounded much worse than its predecessor (checkpoint_300000" file), I know sometimes resuming a training requires some warming up before returning to its original state, but this isn't happening after a week, the results are not even close to its predecessor, if I had a time machine and could have prevented the blackout now the new checkpoint (ie. checkpoint_400000) would have sounded better not worse than before, do I have to start over again from scratch and lose weeks of training or I did something wrong? Thanks for your patience.

texpomru13 · 2020-01-08T11:47:11Z

@AndroYD84 try changing the hyperparameter before resuming from checkpoint: ignore_layers=[] and use_saved_learning_rate=True

rafaelvalle · 2020-01-08T21:17:29Z

Note that when using --warm_start does not include the optimizer.
When resuming from your own model. you should not include --warm_start.

AndroYD84 · 2020-01-20T03:44:45Z

Thank you for your help, I resumed training without the "warm_start" option and I confirm that so far I haven't noticed any quality loss, I haven't tried texpomru13 solution as the results were already improving without the need to change anything.
However, if at some point I notice that the model is not improving any further then I plan to test the other solution as well, but right now I didn't want to jinx it.

AndroYD84 closed this as completed Jan 20, 2020

camjac251 mentioned this issue Apr 2, 2020

Inference troubles on Windows #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot resume training without quality loss #30

Cannot resume training without quality loss #30

AndroYD84 commented Jan 6, 2020

rafaelvalle commented Jan 8, 2020

AndroYD84 commented Jan 8, 2020 •

edited

Loading

texpomru13 commented Jan 8, 2020

rafaelvalle commented Jan 8, 2020 •

edited

Loading

AndroYD84 commented Jan 20, 2020

Cannot resume training without quality loss #30

Cannot resume training without quality loss #30

Comments

AndroYD84 commented Jan 6, 2020

rafaelvalle commented Jan 8, 2020

AndroYD84 commented Jan 8, 2020 • edited Loading

texpomru13 commented Jan 8, 2020

rafaelvalle commented Jan 8, 2020 • edited Loading

AndroYD84 commented Jan 20, 2020

AndroYD84 commented Jan 8, 2020 •

edited

Loading

rafaelvalle commented Jan 8, 2020 •

edited

Loading