Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.test(datamodule=dm) stores reference to wrong checkpoint #4693

Closed
ananyahjha93 opened this issue Nov 16, 2020 · 2 comments · Fixed by Lightning-Universe/lightning-bolts#371
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@ananyahjha93
Copy link
Contributor

ananyahjha93 commented Nov 16, 2020

🐛 Bug

When finetuning from saved weights in bolts, trainer.test() picks up reference to checkpoints which have already been deleted or not yet created.
Checkpoint created using default trainer options, no callbacks added from the user's side.

Please reproduce using [the BoringModel and post here]

Not sure how to reproduce fine-tuning from a checkpoint using the boring model.

To Reproduce

  1. clone bolts using git clone https://github.com/PyTorchLightning/pytorch-lightning-bolts.git
  2. cd to pl_bolts/models/self_supervised/swav/
  3. wget 'https://pl-bolts-weights.s3.us-east-2.amazonaws.com/swav/checkpoints/swav_stl10.pth.tar'
  4. python swav_finetuner.py --ckpt swav_stl10.pth.tar --dataset stl10 --batch_size 256 --gpus 1 --learning_rate 0.1

Latest saved checkpoint is say 'epoch=33.ckpt' but line 712 in trainer.py looks for other saved checkpoints which might be epochs before or after the one present in checkpoints folder.

  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 712, in test
    results = self.__test_using_best_weights(ckpt_path, test_dataloaders)

Error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/pytorch_lightning_bolts/pl_bolts/models/self_supervised/swav/lightning_logs/version_3/checkpoints/epoch=7.ckpt'
FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/pytorch_lightning_bolts/pl_bolts/models/self_supervised/swav/lightning_logs/version_3/checkpoints/epoch=21.ckpt'
FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/pytorch_lightning_bolts/pl_bolts/models/self_supervised/swav/lightning_logs/version_3/checkpoints/epoch=37.ckpt'

Expected behavior

trainer.test(datamodule=dm) should pickup the reference to the correct checkpoint saved in lightning_logs/version_x/checkpoints

Environment

PyTorch Lightning version 1.0.4+ (tested with both 1.0.4 and 1.0.6)
bolts from master

  • PyTorch Version (e.g., 1.0): 1.6
  • OS (e.g., Linux): linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version:
  • GPU models and configuration: V100s
  • Any other relevant information:

Additional context

@ananyahjha93 ananyahjha93 added bug Something isn't working help wanted Open to be worked on priority: 0 High priority task labels Nov 16, 2020
@SeanNaren SeanNaren self-assigned this Nov 16, 2020
@Borda
Copy link
Member

Borda commented Nov 16, 2020

you please say how the checkpoint was created? and what Pl version was used?

@ananyahjha93
Copy link
Contributor Author

@Borda added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants