Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming training seems unstable #419

Closed
roedoejet opened this issue May 3, 2024 · 8 comments · Fixed by EveryVoiceTTS/FastSpeech2_lightning#82 or #473
Closed

Resuming training seems unstable #419

roedoejet opened this issue May 3, 2024 · 8 comments · Fixed by EveryVoiceTTS/FastSpeech2_lightning#82 or #473
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed
Milestone

Comments

@roedoejet
Copy link
Member

roedoejet commented May 3, 2024

when I resume training, there's an initial spike in the loss that I don't expect:

image

Hypothesis

  • Random seeds not been properly initialized. Note that there are several seeds that need to be initialized
  • The eval is calulated over a sample of validation
@roedoejet roedoejet added bug Something isn't working help wanted Extra attention is needed labels May 3, 2024
@roedoejet roedoejet added this to the alpha milestone May 3, 2024
@roedoejet
Copy link
Member Author

@SamuelLarkin says: "maybe it's a random seed that does not get saved in the checkpoint"

@SamuelLarkin SamuelLarkin self-assigned this May 10, 2024
@SamuelLarkin
Copy link
Collaborator

SamuelLarkin commented Jun 4, 2024

What are the losses' value after the initial fitting and what are the same losses before resuming?
We can see that they are all the same except for validation/attn_bin_loss.

Post fitting
DebugCallback::on_validation_start(ce=3, gs=270)
DebugCallback::on_validation_epoch_start(ce=3, gs=270)
DebugCallback::on_validation_epoch_end(ce=3, gs=270)
DebugCallback::on_validation_end(ce=3, gs=270)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ validation/attn_bin_loss  │   0.005417258013039827    │
│ validation/attn_ctc_loss  │    0.4814768135547638     │
│ validation/duration_loss  │    0.08266130089759827    │
│  validation/energy_loss   │    0.08383136242628098    │
│   validation/pitch_loss   │    0.10729385912418365    │
│   validation/spec_loss    │     2.699561834335327     │
│   validation/total_loss   │    3.4602415561676025     │
└───────────────────────────┴───────────────────────────┘
Pre fitting
DebugCallback::on_validation_start(ce=0, gs=0)
DebugCallback::on_validation_epoch_start(ce=0, gs=0)
DebugCallback::on_validation_epoch_end(ce=0, gs=0)
DebugCallback::on_validation_end(ce=0, gs=0)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ validation/attn_bin_loss  │            0.0            │
│ validation/attn_ctc_loss  │    0.4814768135547638     │
│ validation/duration_loss  │    0.08266130089759827    │
│  validation/energy_loss   │    0.08383136242628098    │
│   validation/pitch_loss   │    0.10729385912418365    │
│   validation/spec_loss    │     2.699561834335327     │
│   validation/total_loss   │     3.454824209213257     │
└───────────────────────────┴───────────────────────────┘
DebugCallback::on_train_start(ce=2, gs=270)
DebugCallback::on_train_epoch_start(ce=2, gs=270)
DebugCallback::on_validation_start(ce=2, gs=270)
DebugCallback::on_validation_epoch_start(ce=2, gs=270)
DebugCallback::on_validation_epoch_end(ce=2, gs=270)
DebugCallback::on_validation_end(ce=2, gs=270)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ validation/attn_bin_loss  │   0.003311760723590851    │
│ validation/attn_ctc_loss  │    0.4455091953277588     │
│ validation/duration_loss  │    0.07343553006649017    │
│  validation/energy_loss   │    0.05731743574142456    │
│   validation/pitch_loss   │   0.046959176659584045    │
│   validation/spec_loss    │    2.4098148345947266     │
│   validation/total_loss   │    3.0363481044769287     │
└───────────────────────────┴───────────────────────────┘

@SamuelLarkin
Copy link
Collaborator

SamuelLarkin commented Jun 4, 2024

Curiosities

TL;DR; "weights only" is not just the model's weights.

In [8]: m.keys()
Out[8]: dict_keys(['epoch', 'global_step', 'pytorch-lightning_version', 'state_dict', 'loops', 'hparams_name', 'hyper_parameters'])

What are we looking at?
We've created a custom Callback that saves the model's weights only on certain events.
The name of the files are:

  • ce stands for current epoch
  • gs stands for global step
  • _wo weights only

We run EveryVoice twice, once as the initial fitting for 3 epochs followd by trainer.validate().
Then we resume training where we trainer.validate() before and after trainer.fit().
Resuming should be on epoch 4.

We can see that we only get on_fit_X for current epoch 3 with global step 270 and there is no on_fit_X when we resumed at epoch 4.

lsd ce3*_wo ce4*_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:10 2024  ce3.gs270.on_fit_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:09 2024  ce3.gs270.on_train_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:57 2024  ce3.gs270.on_train_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:12 2024  ce3.gs270.on_validation_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:12 2024  ce3.gs270.on_validation_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:11 2024  ce3.gs270.on_validation_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:19:10 2024  ce3.gs270.on_validation_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:11 2024  ce3.gs360.on_train_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:10 2024  ce3.gs360.on_validation_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:10 2024  ce3.gs360.on_validation_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:08 2024  ce3.gs360.on_validation_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:07 2024  ce3.gs360.on_validation_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:11 2024  ce4.gs360.on_train_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:23 2024  ce4.gs450.on_train_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:22 2024  ce4.gs450.on_validation_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:22 2024  ce4.gs450.on_validation_epoch_end_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:20 2024  ce4.gs450.on_validation_epoch_start_wo
.rw-rw---- sam037 nrc_ict 53 MB Tue Jun  4 09:20:20 2024  ce4.gs450.on_validation_start_wo

Why is there ce3.gs270 and ce3.gs360?
Shouldn't there only be ce3.gsX?

We would also expect that, when resuming, we would see weights only files been identical between epoch 3's on_X_end and epoch 4's on_X_start.

sha1sum  ce3.gs270.on_fit_end_wo ce3.gs270.on_train_end_wo ce3.gs270.on_validation_end_wo ce3.gs270.on_validation_epoch_end_wo ce3.gs360.on_train_epoch_end_wo ce3.gs360.on_validation_end_wo ce3.gs360.on_validation_epoch_end_wo ce4.gs360.on_train_epoch_start_wo ce4.gs450.on_train_epoch_end_wo ce4.gs450.on_validation_epoch_start_wo ce4.gs450.on_validation_start_wo```

682c060aff4fa606749e17715e560460da3531ed ce3.gs270.on_fit_end_wo
2b3f0348d04b68191dfa27f4b39450b61da38213 ce3.gs270.on_train_end_wo
7da2f008ae5f3354c6d018ec384174e90e04c99c ce3.gs270.on_validation_end_wo
7da2f008ae5f3354c6d018ec384174e90e04c99c ce3.gs270.on_validation_epoch_end_wo
c9dc67a4e0a2ec64e319013d6458b8fb745852e7 ce3.gs360.on_train_epoch_end_wo
79583edd83f130303351decf4f872db2a4cc4cc3 ce3.gs360.on_validation_end_wo
79583edd83f130303351decf4f872db2a4cc4cc3 ce3.gs360.on_validation_epoch_end_wo
e2a1f1992bb6d2a572e8573b749dd9bd945d7812 ce4.gs360.on_train_epoch_start_wo
186d232b164db473059e30310704e49a832176a3 ce4.gs450.on_train_epoch_end_wo
714da2b8fdcce68a5db4ab84f19914adbd4e5cce ce4.gs450.on_validation_epoch_start_wo
714da2b8fdcce68a5db4ab84f19914adbd4e5cce ce4.gs450.on_validation_start_wo

@SamuelLarkin
Copy link
Collaborator

SamuelLarkin commented Jun 5, 2024

TL;DR; the weights are the same at the end of training from scratch and at the beginning of resuming.

Are the weights changing unexpectedly or the fluctuation are solely due to current_epoch and global_step been out of whack?

Given that we are saving the model during training on different triggered events, let make sure the model's weights are the same for events where they should be the same.

From scratch

5cfe941437647e2ef1d254dfc53e3c772972f44d32dd1ab881ee76bd6fadc6da on/004.ce0.gs0.on_train_start_wo
5cfe941437647e2ef1d254dfc53e3c772972f44d32dd1ab881ee76bd6fadc6da on/005.ce0.gs0.on_train_epoch_start_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/006.ce0.gs90.on_validation_start_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/007.ce0.gs90.on_validation_epoch_start_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/008.ce0.gs90.on_validation_epoch_end_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/009.ce0.gs90.on_validation_end_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/010.ce0.gs90.on_train_epoch_end_wo
23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 on/011.ce1.gs90.on_train_epoch_start_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/012.ce1.gs180.on_validation_start_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/013.ce1.gs180.on_validation_epoch_start_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/014.ce1.gs180.on_validation_epoch_end_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/015.ce1.gs180.on_validation_end_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/016.ce1.gs180.on_train_epoch_end_wo
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 on/017.ce2.gs180.on_train_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/018.ce2.gs270.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/019.ce2.gs270.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/020.ce2.gs270.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/021.ce2.gs270.on_validation_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/022.ce2.gs270.on_train_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/023.ce3.gs270.on_train_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/024.ce3.gs270.on_fit_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/025.ce3.gs270.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/026.ce3.gs270.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/027.ce3.gs270.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/028.ce3.gs270.on_validation_end_wo

When resuming

3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/000.ce0.gs0.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/001.ce0.gs0.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/002.ce0.gs0.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/003.ce0.gs0.on_validation_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/004.ce2.gs270.on_train_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/005.ce2.gs270.on_train_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/006.ce2.gs270.on_validation_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/007.ce2.gs270.on_validation_epoch_start_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/008.ce2.gs270.on_validation_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/009.ce2.gs270.on_validation_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/010.ce2.gs270.on_train_epoch_end_wo
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 on/011.ce3.gs270.on_train_epoch_start_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/012.ce3.gs360.on_validation_start_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/013.ce3.gs360.on_validation_epoch_start_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/014.ce3.gs360.on_validation_epoch_end_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/015.ce3.gs360.on_validation_end_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/016.ce3.gs360.on_train_epoch_end_wo
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 on/017.ce4.gs360.on_train_epoch_start_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/018.ce4.gs450.on_validation_start_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/019.ce4.gs450.on_validation_epoch_start_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/020.ce4.gs450.on_validation_epoch_end_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/021.ce4.gs450.on_validation_end_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/022.ce4.gs450.on_train_epoch_end_wo
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e on/023.ce5.gs450.on_train_epoch_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/024.ce5.gs540.on_validation_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/025.ce5.gs540.on_validation_epoch_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/026.ce5.gs540.on_validation_epoch_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/027.ce5.gs540.on_validation_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/028.ce5.gs540.on_train_epoch_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/029.ce6.gs540.on_train_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/030.ce6.gs540.on_fit_end_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/031.ce6.gs540.on_validation_start_wo
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 on/032.ce6.gs540.on_validation_epoch_start_wo

Checkpoints

23d37b62c88b1a0d728d954e9f4443bc2aefc4156587038324d3275eed34e337 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=0-step=90.ckpt
f416717746ca66972a53c1a81a2d5fc5bab22498317a86ce68e099a88f13dd55 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=1-step=180.ckpt
3481750e0ed72c6078a9b6045266f132ba8990f62a3c153c95fe049496f5a295 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=2-step=270.ckpt
54f1a34cb537a08d0109952e8ec1a785732e610246ec92d198f69eeda3d95e11 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=3-step=360.ckpt
852c45af4269dd93e65b01f7eae1e5e1e6890ede0a81008c695dda115fc8c12e logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=4-step=450.ckpt
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=5-step=540-v1.ckpt
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=5-step=540.ckpt
cac0bedd6ed2b69cb531b160e98087a268e6429af56697e4a0e05d3223cc5584 logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt

@SamuelLarkin
Copy link
Collaborator

2024-06-06 lead
I looks like when training restart, there is a validation loop that is triggered before training. Instead of validating using all validation examples, it only uses one example. Entering the validation loop, the dataloader has already fetched=55 and has a length=50 which upon the next read, changes the dataloader's state to done. Why remains a mystery for now.

@roedoejet roedoejet modified the milestones: alpha, beta Jun 10, 2024
@SamuelLarkin
Copy link
Collaborator

2024-06-11
If we reset the validation dataloader before saving it to a checkpoint, we can get smooth resuming for all loss except the binary attention loss.
image(3)

class ResetValidationDataloaderCallback(Callback):
    @override
    def on_save_checkpoint(
        self,
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        checkpoint: Dict[str, Any],
    ) -> None:
      # Reset the validation progress to allow resuming and
      # validating a full validation set and not just the first
      # example in the validation set.
      batch_progress = trainer.fit_loop.epoch_loop.val_loop.batch_progress
      batch_progress.reset()

Unfortunately, this doesn't help for attn_bin_loss because it uses the current_epoch and it isn't properly "reloaded".

@SamuelLarkin
Copy link
Collaborator

Notes
Why is this done?

reset [_EvaluationLoop] evaluation_loop.py:229
+5
# add the previous `fetched` value to properly track `is_last_batch` with no prefetching
data_fetcher.fetched += self.batch_progress.current.ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
2 participants