Cant reload from checkpoint when using SWA #11665

ma-batita · 2022-01-31T08:52:44Z

🐛 Bug

My model worked just fine until I tried some optimisation using SWA.

from pytorch_lightning.callbacks import  StochasticWeightAveraging

weighting = StochasticWeightAveraging()

The problem is not even clear to understand :

KeyError                                  Traceback (most recent call last)
<ipython-input-20-2d36fa4eaad0> in <module>()
     16 
     17 
---> 18 trainer.fit(module, data_module, ckpt_path="./checkpoints/best-checkpoint.ckpt")
     19 
     20 wandb.finish()

7 frames
/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py in load_state_dict(self, state_dict)
    233         """
    234 
--> 235         lr_lambdas = state_dict.pop('lr_lambdas')
    236         self.__dict__.update(state_dict)
    237         # Restore state_dict keys in order to prevent side effects

KeyError: 'lr_lambdas'

To Reproduce

https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report/bug_report_model.ipynb

Expected behavior

Run from checkpoint with SWA.

Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-16GB
- available: True
- version: 11.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.10.0+cu111
- pytorch-lightning: 1.5.9
- tqdm: 4.62.3
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.12
- version: Proposal for help #1 SMP Tue Dec 7 09:58:10 PST 2021

cc @tchaton @rohitgr7 @akihironitta @carmocca

The text was updated successfully, but these errors were encountered:

myxik · 2022-02-01T09:47:58Z

Hi! Can I take this issue?

rohitgr7 · 2022-02-01T10:51:50Z

hey @BttMA can you update the reproducible colab link? currently it points to the one in the repo which doesn't have any of your update code.

ma-batita · 2022-02-01T11:09:29Z

Hello :) @rohitgr7
I hope this one will work :)
https://colab.research.google.com/drive/1JHaHvQ5PhfaYil0HnIkEOcF1MoindYcK?usp=sharing

rohitgr7 · 2022-02-01T11:19:43Z

hey @BttMA !
can you share an actual failing script?

I tried updating your example with:

def run(max_epochs, ckpt_path=None):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=max_epochs,
        enable_model_summary=False,
        callbacks = weighting ############
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data, ckpt_path=ckpt_path)
    return trainer

trainer = run(max_epochs=5, ckpt_path=None)
trainer.save_checkpoint('best_checkpoint.ckpt')
ckpt_path = 'lightning_logs/version_0/checkpoints/epoch=4-step=4.ckpt'
trainer = run(max_epochs=20, ckpt_path=ckpt_path)

and it worked fine... so I guess I am unable to reproduce your issue.

ma-batita · 2022-02-01T13:50:49Z

Sorry me neither I couldn't share the bug with you using the boring model. is any either way to share it ?

rohitgr7 · 2022-02-01T13:52:34Z

share the notebook/script that is failing. You can attach it here too, in case someone else want to look at it.

ma-batita · 2022-02-01T14:01:57Z

it has personal/confidential data :/ I cant share it with everybody :o
Sorry is there any other way to share it with you ?

rohitgr7 · 2022-02-01T14:33:35Z

maybe you can mimic the data. for starters, return random tensors of the same shape as your original data returns and use a small model.

ma-batita · 2022-02-01T15:00:03Z

hi @rohitgr7 :)
You can see the fail with this random generation :) hope you manage to solve it soon :)

https://colab.research.google.com/drive/1JHaHvQ5PhfaYil0HnIkEOcF1MoindYcK?usp=sharing#scrollTo=sNm4IkAefdkL&uniqifier=1

PS : maybe it has something to do with "UserWarning: SWA is currently only supported every epoch." or "Swapping scheduler LambdaLR for SWALR" ? but they are just warnings and cant lead to a fail loading!

rohitgr7 · 2022-02-02T14:50:51Z

yes! that might be the case.. We need to save and load the states for this callback to enable proper resuming.

ma-batita · 2022-02-02T15:05:47Z

What callback? Sorry didn't get you.
as you can see in the dummy code, when I get the "UserWarning: SWA is currently only supported every epoch." the program skip it right away and continue other epoch? also in my real version of code I get the same warning after like 10 to 15 epochs! it does not make any sense 😵

Maybe the SWA does not support many epochs 🤔 ?? even in this doc it is not very clear about the epoch and the SWA. We have to dig deep into this! 😜

rohitgr7 · 2022-02-02T15:17:49Z

I am talking about StochasticWeightAveraging

the warning isn't reliable. Should be improved. I only got to know what it means by looking at the code.

UserWarning: SWA is currently only supported every epoch.

ideally it means if you are configuring interval='step' or frequency>1 inside scheduler configuration, it should work as per configuration expectation.
as you have done here:

return dict(lr_scheduler=dict(scheduler=scheduler, interval='step'),
                    optimizer=optimizer)

also in my real version of code I get the same warning after like 10 to 15 epochs! it does not make any sense 😵also in my real version of code I get the same warning after like 10 to 15 epochs! it does not make any sense 😵

check out the default parameters. by default it starts at when epoch=0.8*max_epochs.

ma-batita · 2022-02-02T15:35:32Z

check out the default parameters. by default it starts at when epoch=0.8*max_epochs.

now I see! for exemple if I have 100 epochs then the SWA callback will be activated at the 40th epoch (since 40=0.8*100).

In my case, the SWA callback is just skipped because of the scheduler LambdaLR. It is never executed and when I try to load a checkpoint it gets messy and I got that error ? correct me if I am wrong please ?

rohitgr7 · 2022-02-02T15:41:28Z

40th epoch (since 40=0.8*100)

80th epoch.

In your example, during the first run, it switch to SWALR at 40th epoch and saved the checkpoint at 50th epoch with SWALR state_dict. But when you reloaded the checkpoint, the trainer loaded them with LAmbda LR configured. Something like LambdaLR is trying to load the state_dict of SWALR, which is causing this error.

ma-batita · 2022-02-02T15:52:47Z

80th epoch.

OH!! sure yes yes!!

In your example, during the first run, it switch to SWALR at 40th epoch and saved the checkpoint at 50th epoch with SWALR state_dict. But when you reloaded the checkpoint, the trainer loaded them with LAmbda LR configured. Something like LambdaLR is trying to load the state_dict of SWALR, which is causing this error.

Now it makes sense :) Thank a lot!

Can you suggest any thing for me to fix this, please?
Should I change my scheduler in the plModel from LambdaLR for SWALR ? something like this :

swa_scheduler = torch.optim.swa_utils.SWALR(optimizer, anneal_strategy="linear", anneal_epochs=5, swa_lr=0.05)

rohitgr7 · 2022-02-02T17:36:01Z

Should I change my scheduler in the plModel from LambdaLR for SWALR ? something like this :

I'm not sure if this will work. I am not super familiar with every detail for SWA but I don't think that replacing the scheduler is all that's required to perform SWA. There's a lot more happening inside the callback. For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training. Need to investigate what all is required to make this work.

ma-batita · 2022-02-03T08:04:59Z

For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training.

Actually I was going to suggest that but I don't know what held me 😅
I will keep the issue open for further investigation (it will be helpful if you could mention other members.)

thanks a lot!

carmocca · 2022-02-04T17:03:18Z

For the fix, I think we need to create states for this callback that can be stored and reloaded from the checkpoint while resuming the training

This is correct. Saving and loading is not implemented.

Should I change my scheduler in the plModel from LambdaLR for SWALR?

This is done by the callback automatically.

ma-batita added the bug Something isn't working label Jan 31, 2022

tchaton added priority: 0 High priority task priority: 1 Medium priority task labels Jan 31, 2022

rohitgr7 added the callback: swa label Feb 3, 2022

rohitgr7 self-assigned this Feb 4, 2022

rohitgr7 mentioned this issue Feb 7, 2022

Support checkpoint save and load with Stochastic Weight Averaging #9938

Merged

13 tasks

Borda removed the priority: 0 High priority task label Aug 8, 2022

rohitgr7 linked a pull request Aug 8, 2022 that will close this issue

Support checkpoint save and load with Stochastic Weight Averaging #9938

Merged

13 tasks

awaelchli closed this as completed in #9938 Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cant reload from checkpoint when using SWA #11665

Cant reload from checkpoint when using SWA #11665

ma-batita commented Jan 31, 2022 •

edited by Borda

myxik commented Feb 1, 2022

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022 •

edited

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 2, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 2, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 2, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 3, 2022

carmocca commented Feb 4, 2022

Cant reload from checkpoint when using SWA #11665

Cant reload from checkpoint when using SWA #11665

Comments

ma-batita commented Jan 31, 2022 • edited by Borda

🐛 Bug

To Reproduce

Expected behavior

Environment

myxik commented Feb 1, 2022

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022 • edited

rohitgr7 commented Feb 1, 2022

ma-batita commented Feb 1, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 2, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 2, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 2, 2022

rohitgr7 commented Feb 2, 2022

ma-batita commented Feb 3, 2022

carmocca commented Feb 4, 2022

ma-batita commented Jan 31, 2022 •

edited by Borda

ma-batita commented Feb 1, 2022 •

edited