Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46

Merged
merged 1 commit into from
Sep 18, 2023

Conversation

roedoejet
Copy link
Member

@roedoejet roedoejet commented Jun 13, 2023

fixes #41

@davidguzmanr - can you try this PR out and see if it fixes the issue for you. It's a little annoying because it seems to lose all information about the current epoch - so the logger reverts to epoch 0, but it appears to have correctly loaded the weights and updated hyperparameters. I find it a bit surprising that PyTorch Lightning doesn't have this figured out yet. I'm going to leave this as a draft (ok - drafts aren't possible on private repos, but please don't merge this) because I don't really want to lose the ability to statefully resume training, which this PR seems to do. But, it will let @davidguzmanr run experiments with an updated LR until we figure out a better option.

@davidguzmanr
Copy link
Collaborator

It seems to be fine-tuning correctly now. The only issue as you mentioned is that the logger reverts to epoch 0, I think the TensorBoard logs can be modified to add the offset in the number of epochs if you want to visualize that, I think it is more important to be able to resume training from the checkpoint.

@roedoejet
Copy link
Member Author

This looks like total overkill for what we want to do, but might be worthwhile to implement: https://lightning.ai/docs/pytorch/stable/notebooks/lightning_examples/finetuning-scheduler.html

@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Merging #46 (b8d520b) into dev.cli (d30c847) will decrease coverage by 0.32%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           dev.cli      #46      +/-   ##
===========================================
- Coverage    55.02%   54.70%   -0.32%     
===========================================
  Files           44       44              
  Lines         2588     2603      +15     
  Branches       347      350       +3     
===========================================
  Hits          1424     1424              
- Misses        1091     1106      +15     
  Partials        73       73              
Files Changed Coverage Δ
everyvoice/base_cli/helpers.py 0.00% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@roedoejet
Copy link
Member Author

ok, so this is an updated patch that is ready for review and is now rebased on to #75 as well

Copy link
Collaborator

@davidguzmanr davidguzmanr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it still doesn't update the training hyperparameters (blue line) and instead continues using the ones from the checkpoint (orange line). I think it is the scheduler and maybe the optimizer states that also need to be updated

finetune

1) training from scratch 2) resuming from a checkpoint without changes (preserves epoch and current step) and 3) fine-tuning by changing values in the training configuration
@roedoejet roedoejet changed the base branch from main to dev.cli September 18, 2023 23:48
@roedoejet roedoejet merged commit 4b30f46 into dev.cli Sep 18, 2023
1 of 3 checks passed
@roedoejet roedoejet deleted the dev.finetune branch September 18, 2023 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug with overwriting checkpoint training hyperparameters when finetuning
2 participants