fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46

roedoejet · 2023-06-13T00:01:57Z

fixes #41

@davidguzmanr - can you try this PR out and see if it fixes the issue for you. It's a little annoying because it seems to lose all information about the current epoch - so the logger reverts to epoch 0, but it appears to have correctly loaded the weights and updated hyperparameters. I find it a bit surprising that PyTorch Lightning doesn't have this figured out yet. I'm going to ~~leave this as a draft~~ (ok - drafts aren't possible on private repos, but please don't merge this) because I don't really want to lose the ability to statefully resume training, which this PR seems to do. But, it will let @davidguzmanr run experiments with an updated LR until we figure out a better option.

davidguzmanr · 2023-06-13T21:44:59Z

It seems to be fine-tuning correctly now. The only issue as you mentioned is that the logger reverts to epoch 0, I think the TensorBoard logs can be modified to add the offset in the number of epochs if you want to visualize that, I think it is more important to be able to resume training from the checkpoint.

roedoejet · 2023-06-15T19:53:26Z

This looks like total overkill for what we want to do, but might be worthwhile to implement: https://lightning.ai/docs/pytorch/stable/notebooks/lightning_examples/finetuning-scheduler.html

codecov · 2023-09-11T19:04:46Z

Codecov Report

Merging #46 (b8d520b) into dev.cli (d30c847) will decrease coverage by 0.32%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           dev.cli      #46      +/-   ##
===========================================
- Coverage    55.02%   54.70%   -0.32%     
===========================================
  Files           44       44              
  Lines         2588     2603      +15     
  Branches       347      350       +3     
===========================================
  Hits          1424     1424              
- Misses        1091     1106      +15     
  Partials        73       73

Files Changed	Coverage Δ
everyvoice/base_cli/helpers.py	`0.00% <0.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

roedoejet · 2023-09-11T19:07:14Z

ok, so this is an updated patch that is ready for review and is now rebased on to #75 as well

davidguzmanr

It seems it still doesn't update the training hyperparameters (blue line) and instead continues using the ones from the checkpoint (orange line). I think it is the scheduler and maybe the optimizer states that also need to be updated

1) training from scratch 2) resuming from a checkpoint without changes (preserves epoch and current step) and 3) fine-tuning by changing values in the training configuration

roedoejet force-pushed the dev.finetune branch from 203ddbb to 4d83b1c Compare September 11, 2023 19:00

roedoejet force-pushed the dev.finetune branch from 4d83b1c to 4876a5e Compare September 11, 2023 19:05

roedoejet requested review from SamuelLarkin, marctessier and davidguzmanr September 11, 2023 19:07

roedoejet force-pushed the dev.finetune branch from 4876a5e to 955747b Compare September 12, 2023 16:08

davidguzmanr reviewed Sep 13, 2023

View reviewed changes

fix(finetune): split training into three parts

b8d520b

1) training from scratch 2) resuming from a checkpoint without changes (preserves epoch and current step) and 3) fine-tuning by changing values in the training configuration

roedoejet force-pushed the dev.finetune branch from 955747b to b8d520b Compare September 18, 2023 23:48

roedoejet changed the base branch from main to dev.cli September 18, 2023 23:48

roedoejet merged commit 4b30f46 into dev.cli Sep 18, 2023
1 of 3 checks passed

roedoejet deleted the dev.finetune branch September 18, 2023 23:53

roedoejet mentioned this pull request Jan 16, 2024

Bug with overwriting checkpoint training hyperparameters when finetuning #41

Closed

roedoejet mentioned this pull request Jan 26, 2024

unable to finetune_checkpoint --> Error "YourDataSet" in path #238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46

fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46

roedoejet commented Jun 13, 2023 •

edited

Loading

davidguzmanr commented Jun 13, 2023

roedoejet commented Jun 15, 2023

codecov bot commented Sep 11, 2023 •

edited

Loading

roedoejet commented Sep 11, 2023

davidguzmanr left a comment

fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46

fix(trainer): load checkpoint without ckpt_path which doesn't allow for updated hyperparameters #46

Conversation

roedoejet commented Jun 13, 2023 • edited Loading

davidguzmanr commented Jun 13, 2023

roedoejet commented Jun 15, 2023

codecov bot commented Sep 11, 2023 • edited Loading

Codecov Report

roedoejet commented Sep 11, 2023

davidguzmanr left a comment

Choose a reason for hiding this comment

roedoejet commented Jun 13, 2023 •

edited

Loading

codecov bot commented Sep 11, 2023 •

edited

Loading