-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46
Conversation
It seems to be fine-tuning correctly now. The only issue as you mentioned is that the logger reverts to epoch 0, I think the TensorBoard logs can be modified to add the offset in the number of epochs if you want to visualize that, I think it is more important to be able to resume training from the checkpoint. |
This looks like total overkill for what we want to do, but might be worthwhile to implement: https://lightning.ai/docs/pytorch/stable/notebooks/lightning_examples/finetuning-scheduler.html |
203ddbb
to
4d83b1c
Compare
Codecov Report
@@ Coverage Diff @@
## dev.cli #46 +/- ##
===========================================
- Coverage 55.02% 54.70% -0.32%
===========================================
Files 44 44
Lines 2588 2603 +15
Branches 347 350 +3
===========================================
Hits 1424 1424
- Misses 1091 1106 +15
Partials 73 73
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
4d83b1c
to
4876a5e
Compare
ok, so this is an updated patch that is ready for review and is now rebased on to #75 as well |
4876a5e
to
955747b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1) training from scratch 2) resuming from a checkpoint without changes (preserves epoch and current step) and 3) fine-tuning by changing values in the training configuration
955747b
to
b8d520b
Compare
fixes #41
@davidguzmanr - can you try this PR out and see if it fixes the issue for you. It's a little annoying because it seems to lose all information about the current epoch - so the logger reverts to epoch 0, but it appears to have correctly loaded the weights and updated hyperparameters. I find it a bit surprising that PyTorch Lightning doesn't have this figured out yet. I'm going to
leave this as a draft(ok - drafts aren't possible on private repos, but please don't merge this) because I don't really want to lose the ability to statefully resume training, which this PR seems to do. But, it will let @davidguzmanr run experiments with an updated LR until we figure out a better option.