-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming training seems unstable #419
Resuming training seems unstable #419
Comments
@SamuelLarkin says: "maybe it's a random seed that does not get saved in the checkpoint" |
Notes: |
What are the losses' value after the initial fitting and what are the same losses before resuming?
|
CuriositiesTL;DR; "weights only" is not just the model's weights.
What are we looking at?
We run EveryVoice twice, once as the initial fitting for 3 epochs followd by We can see that we only get lsd ce3*_wo ce4*_wo
Why is there We would also expect that, when resuming, we would see weights only files been identical between epoch 3's sha1sum ce3.gs270.on_fit_end_wo ce3.gs270.on_train_end_wo ce3.gs270.on_validation_end_wo ce3.gs270.on_validation_epoch_end_wo ce3.gs360.on_train_epoch_end_wo ce3.gs360.on_validation_end_wo ce3.gs360.on_validation_epoch_end_wo ce4.gs360.on_train_epoch_start_wo ce4.gs450.on_train_epoch_end_wo ce4.gs450.on_validation_epoch_start_wo ce4.gs450.on_validation_start_wo``` 682c060aff4fa606749e17715e560460da3531ed ce3.gs270.on_fit_end_wo
|
TL;DR; the weights are the same at the end of training from scratch and at the beginning of resuming. Are the weights changing unexpectedly or the fluctuation are solely due to Given that we are saving the model during training on different triggered events, let make sure the model's weights are the same for events where they should be the same. From scratch
When resuming
Checkpoints
|
2024-06-06 lead |
Notes
|
when I resume training, there's an initial spike in the loss that I don't expect:
Hypothesis
The text was updated successfully, but these errors were encountered: