Resuming training seems unstable #419

roedoejet · 2024-05-03T23:04:42Z

when I resume training, there's an initial spike in the loss that I don't expect:

Hypothesis

Random seeds not been properly initialized. Note that there are several seeds that need to be initialized
The eval is calulated over a sample of validation

roedoejet · 2024-05-06T19:04:48Z

@SamuelLarkin says: "maybe it's a random seed that does not get saved in the checkpoint"

SamuelLarkin · 2024-05-31T18:54:53Z

Notes:
Could it be related to (mostly here so we don't lose the links):?

SamuelLarkin · 2024-06-04T13:07:46Z

What are the losses' value after the initial fitting and what are the same losses before resuming?
We can see that they are all the same except for validation/attn_bin_loss.

Post fitting
DebugCallback::on_validation_start(ce=3, gs=270)
DebugCallback::on_validation_epoch_start(ce=3, gs=270)
DebugCallback::on_validation_epoch_end(ce=3, gs=270)
DebugCallback::on_validation_end(ce=3, gs=270)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ validation/attn_bin_loss  │   0.005417258013039827    │
│ validation/attn_ctc_loss  │    0.4814768135547638     │
│ validation/duration_loss  │    0.08266130089759827    │
│  validation/energy_loss   │    0.08383136242628098    │
│   validation/pitch_loss   │    0.10729385912418365    │
│   validation/spec_loss    │     2.699561834335327     │
│   validation/total_loss   │    3.4602415561676025     │
└───────────────────────────┴───────────────────────────┘
Pre fitting
DebugCallback::on_validation_start(ce=0, gs=0)
DebugCallback::on_validation_epoch_start(ce=0, gs=0)
DebugCallback::on_validation_epoch_end(ce=0, gs=0)
DebugCallback::on_validation_end(ce=0, gs=0)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ validation/attn_bin_loss  │            0.0            │
│ validation/attn_ctc_loss  │    0.4814768135547638     │
│ validation/duration_loss  │    0.08266130089759827    │
│  validation/energy_loss   │    0.08383136242628098    │
│   validation/pitch_loss   │    0.10729385912418365    │
│   validation/spec_loss    │     2.699561834335327     │
│   validation/total_loss   │     3.454824209213257     │
└───────────────────────────┴───────────────────────────┘
DebugCallback::on_train_start(ce=2, gs=270)
DebugCallback::on_train_epoch_start(ce=2, gs=270)
DebugCallback::on_validation_start(ce=2, gs=270)
DebugCallback::on_validation_epoch_start(ce=2, gs=270)
DebugCallback::on_validation_epoch_end(ce=2, gs=270)
DebugCallback::on_validation_end(ce=2, gs=270)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ validation/attn_bin_loss  │   0.003311760723590851    │
│ validation/attn_ctc_loss  │    0.4455091953277588     │
│ validation/duration_loss  │    0.07343553006649017    │
│  validation/energy_loss   │    0.05731743574142456    │
│   validation/pitch_loss   │   0.046959176659584045    │
│   validation/spec_loss    │    2.4098148345947266     │
│   validation/total_loss   │    3.0363481044769287     │
└───────────────────────────┴───────────────────────────┘

SamuelLarkin · 2024-06-04T13:41:49Z

Curiosities

TL;DR; "weights only" is not just the model's weights.

In [8]: m.keys()
Out[8]: dict_keys(['epoch', 'global_step', 'pytorch-lightning_version', 'state_dict', 'loops', 'hparams_name', 'hyper_parameters'])

What are we looking at?
We've created a custom Callback that saves the model's weights only on certain events.
The name of the files are:

ce stands for current epoch
gs stands for global step
_wo weights only

We run EveryVoice twice, once as the initial fitting for 3 epochs followd by trainer.validate().
Then we resume training where we trainer.validate() before and after trainer.fit().
Resuming should be on epoch 4.

We can see that we only get on_fit_X for current epoch 3 with global step 270 and there is no on_fit_X when we resumed at epoch 4.

lsd ce3*_wo ce4*_wo

.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:10 2024  ce3.gs270.on_fit_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:09 2024  ce3.gs270.on_train_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:57 2024  ce3.gs270.on_train_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:12 2024  ce3.gs270.on_validation_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:12 2024  ce3.gs270.on_validation_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:11 2024  ce3.gs270.on_validation_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:10 2024  ce3.gs270.on_validation_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:11 2024  ce3.gs360.on_train_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:10 2024  ce3.gs360.on_validation_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:10 2024  ce3.gs360.on_validation_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:08 2024  ce3.gs360.on_validation_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:07 2024  ce3.gs360.on_validation_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:11 2024  ce4.gs360.on_train_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:23 2024  ce4.gs450.on_train_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:22 2024  ce4.gs450.on_validation_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:22 2024  ce4.gs450.on_validation_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:20 2024  ce4.gs450.on_validation_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:20 2024  ce4.gs450.on_validation_start_wo

Why is there ce3.gs270 and ce3.gs360?
Shouldn't there only be ce3.gsX?

We would also expect that, when resuming, we would see weights only files been identical between epoch 3's on_X_end and epoch 4's on_X_start.

sha1sum  ce3.gs270.on_fit_end_wo ce3.gs270.on_train_end_wo ce3.gs270.on_validation_end_wo ce3.gs270.on_validation_epoch_end_wo ce3.gs360.on_train_epoch_end_wo ce3.gs360.on_validation_end_wo ce3.gs360.on_validation_epoch_end_wo ce4.gs360.on_train_epoch_start_wo ce4.gs450.on_train_epoch_end_wo ce4.gs450.on_validation_epoch_start_wo ce4.gs450.on_validation_start_wo```

682c060aff4fa606749e17715e560460da3531ed ce3.gs270.on_fit_end_wo
2b3f0348d04b68191dfa27f4b39450b61da38213 ce3.gs270.on_train_end_wo
7da2f008ae5f3354c6d018ec384174e90e04c99c ce3.gs270.on_validation_end_wo
7da2f008ae5f3354c6d018ec384174e90e04c99c ce3.gs270.on_validation_epoch_end_wo
c9dc67a4e0a2ec64e319013d6458b8fb745852e7 ce3.gs360.on_train_epoch_end_wo
79583edd83f130303351decf4f872db2a4cc4cc3 ce3.gs360.on_validation_end_wo
79583edd83f130303351decf4f872db2a4cc4cc3 ce3.gs360.on_validation_epoch_end_wo
e2a1f1992bb6d2a572e8573b749dd9bd945d7812 ce4.gs360.on_train_epoch_start_wo
186d232b164db473059e30310704e49a832176a3 ce4.gs450.on_train_epoch_end_wo
714da2b8fdcce68a5db4ab84f19914adbd4e5cce ce4.gs450.on_validation_epoch_start_wo
714da2b8fdcce68a5db4ab84f19914adbd4e5cce ce4.gs450.on_validation_start_wo

SamuelLarkin · 2024-06-05T19:03:59Z

TL;DR; the weights are the same at the end of training from scratch and at the beginning of resuming.

Are the weights changing unexpectedly or the fluctuation are solely due to current_epoch and global_step been out of whack?

Given that we are saving the model during training on different triggered events, let make sure the model's weights are the same for events where they should be the same.

From scratch

5cfe941437647e2ef1d254dfc53e3c772972f44d32dd1ab881ee76bd6fadc6da on/004.ce0.gs0.on_train_start_wo
5cfe941437647e2ef1d254dfc53e3c772972f44d32dd1ab881ee76bd6fadc6da on/005.ce0.gs0.on_train_epoch_start_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/006.ce0.gs90.on_validation_start_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/007.ce0.gs90.on_validation_epoch_start_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/008.ce0.gs90.on_validation_epoch_end_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/009.ce0.gs90.on_validation_end_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/010.ce0.gs90.on_train_epoch_end_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/011.ce1.gs90.on_train_epoch_start_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/012.ce1.gs180.on_validation_start_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/013.ce1.gs180.on_validation_epoch_start_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/014.ce1.gs180.on_validation_epoch_end_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/015.ce1.gs180.on_validation_end_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/016.ce1.gs180.on_train_epoch_end_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/017.ce2.gs180.on_train_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/018.ce2.gs270.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/019.ce2.gs270.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/020.ce2.gs270.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/021.ce2.gs270.on_validation_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/022.ce2.gs270.on_train_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/023.ce3.gs270.on_train_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/024.ce3.gs270.on_fit_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/025.ce3.gs270.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/026.ce3.gs270.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/027.ce3.gs270.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/028.ce3.gs270.on_validation_end_wo

When resuming

3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/000.ce0.gs0.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/001.ce0.gs0.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/002.ce0.gs0.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/003.ce0.gs0.on_validation_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/004.ce2.gs270.on_train_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/005.ce2.gs270.on_train_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/006.ce2.gs270.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/007.ce2.gs270.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/008.ce2.gs270.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/009.ce2.gs270.on_validation_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/010.ce2.gs270.on_train_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/011.ce3.gs270.on_train_epoch_start_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/012.ce3.gs360.on_validation_start_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/013.ce3.gs360.on_validation_epoch_start_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/014.ce3.gs360.on_validation_epoch_end_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/015.ce3.gs360.on_validation_end_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/016.ce3.gs360.on_train_epoch_end_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/017.ce4.gs360.on_train_epoch_start_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/018.ce4.gs450.on_validation_start_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/019.ce4.gs450.on_validation_epoch_start_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/020.ce4.gs450.on_validation_epoch_end_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/021.ce4.gs450.on_validation_end_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/022.ce4.gs450.on_train_epoch_end_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/023.ce5.gs450.on_train_epoch_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/024.ce5.gs540.on_validation_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/025.ce5.gs540.on_validation_epoch_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/026.ce5.gs540.on_validation_epoch_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/027.ce5.gs540.on_validation_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/028.ce5.gs540.on_train_epoch_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/029.ce6.gs540.on_train_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/030.ce6.gs540.on_fit_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/031.ce6.gs540.on_validation_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/032.ce6.gs540.on_validation_epoch_start_wo

Checkpoints

23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=0-step=90.ckpt
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=1-step=180.ckpt
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=2-step=270.ckpt
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=3-step=360.ckpt
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=4-step=450.ckpt
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=5-step=540-v1.ckpt
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=5-step=540.ckpt
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt

SamuelLarkin · 2024-06-06T21:05:58Z

2024-06-06 lead
I looks like when training restart, there is a validation loop that is triggered before training. Instead of validating using all validation examples, it only uses one example. Entering the validation loop, the dataloader has already fetched=55 and has a length=50 which upon the next read, changes the dataloader's state to done. Why remains a mystery for now.

SamuelLarkin · 2024-06-11T13:53:31Z

2024-06-11
If we reset the validation dataloader before saving it to a checkpoint, we can get smooth resuming for all loss except the binary attention loss.

class ResetValidationDataloaderCallback(Callback):
    @override
    def on_save_checkpoint(
        self,
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        checkpoint: Dict[str, Any],
    ) -> None:
      # Reset the validation progress to allow resuming and
      # validating a full validation set and not just the first
      # example in the validation set.
      batch_progress = trainer.fit_loop.epoch_loop.val_loop.batch_progress
      batch_progress.reset()

Unfortunately, this doesn't help for attn_bin_loss because it uses the current_epoch and it isn't properly "reloaded".

SamuelLarkin · 2024-06-12T15:06:00Z

Notes
Why is this done?

reset [_EvaluationLoop] evaluation_loop.py:229
+5
# add the previous `fetched` value to properly track `is_last_batch` with no prefetching
data_fetcher.fetched += self.batch_progress.current.ready

roedoejet added bug Something isn't working help wanted Extra attention is needed labels May 3, 2024

roedoejet added this to the alpha milestone May 3, 2024

SamuelLarkin self-assigned this May 10, 2024

roedoejet modified the milestones: alpha, beta Jun 10, 2024

This was referenced Jun 18, 2024

fix: only prepare the data once EveryVoiceTTS/FastSpeech2_lightning#82

Merged

feat: help smooth out losses' value on resume #473

Merged

train_base_command() is border line too complex according to flake8 #474

Open

SamuelLarkin closed this as completed in EveryVoiceTTS/FastSpeech2_lightning#82 Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming training seems unstable #419

Resuming training seems unstable #419

roedoejet commented May 3, 2024 •

edited by SamuelLarkin

Loading

roedoejet commented May 6, 2024

SamuelLarkin commented May 31, 2024 •

edited

Loading

SamuelLarkin commented Jun 4, 2024 •

edited

Loading

SamuelLarkin commented Jun 4, 2024 •

edited

Loading

SamuelLarkin commented Jun 5, 2024 •

edited

Loading

SamuelLarkin commented Jun 6, 2024

SamuelLarkin commented Jun 11, 2024

SamuelLarkin commented Jun 12, 2024

Resuming training seems unstable #419

Resuming training seems unstable #419

Comments

roedoejet commented May 3, 2024 • edited by SamuelLarkin Loading

Hypothesis

roedoejet commented May 6, 2024

SamuelLarkin commented May 31, 2024 • edited Loading

SamuelLarkin commented Jun 4, 2024 • edited Loading

SamuelLarkin commented Jun 4, 2024 • edited Loading

Curiosities

SamuelLarkin commented Jun 5, 2024 • edited Loading

SamuelLarkin commented Jun 6, 2024

SamuelLarkin commented Jun 11, 2024

SamuelLarkin commented Jun 12, 2024

roedoejet commented May 3, 2024 •

edited by SamuelLarkin

Loading

SamuelLarkin commented May 31, 2024 •

edited

Loading

SamuelLarkin commented Jun 4, 2024 •

edited

Loading

SamuelLarkin commented Jun 4, 2024 •

edited

Loading

SamuelLarkin commented Jun 5, 2024 •

edited

Loading