You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, if an OnExceptionCheckpoint is provided to the trainer's list of callbacks, then even if Trainer.fit is called without providing ckpt_path argument then the CheckpointConnector will search for a checkpoint in OnExceptionCheckpoint.dirpath, and if one is found it will be used to resume training.
Further, the warnings shown at the beginning of Trainer.fit when OnExceptionCheckpoint callback is enabled IMO are incorrect, (see logs below)
What version are you seeing the problem on?
v2.2
How to reproduce the bug
fromlightning.pytorch.callbacksimportOnExceptionCheckpoint, ModelCheckpointfromlightning.pytorchimportTrainerfromlightning.pytorch.demos.boring_classesimportBoringModeltrainer=Trainer(
# NOTE: either of these callback lists reproduce the issue, (and same warnings mentioned below)# callbacks=[OnExceptionCheckpoint("."), ModelCheckpoint(".", save_last=True)],callbacks=[OnExceptionCheckpoint("."), ModelCheckpoint(".", save_last=False)],
max_epochs=3,
)
trainer.fit(model=BoringModel())
# calling `fit` again will result in resumed training from discovered checkpoint in cwd:trainer.fit(model=BoringModel())
Error messages and logs
The following warnings are confusing and arguably wrong:
.../checkpoint_connector.py:126: `.fit(ckpt_path=None)` was called without a model. The last model of the previous `fit` call will be used. You can pass `fit(ckpt_path='best')` to use the best model or `fit(ckpt_path='last')` to use the last model. If you pass a value, this warning will be silenced.
.../checkpoint_connector.py:186: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded. HINT: Set `ModelCheckpoint(..., save_last=True)`.```
### Environment
PyTorch Lightning Version: 2.2.1
### More info
_No response_
The text was updated successfully, but these errors were encountered:
Bug description
Trainer.fit
states training will only be resumed ifckpt_path
is provided.OnExceptionCheckpoint
states it's purpose is to save a checkpoint on exception.However, if an
OnExceptionCheckpoint
is provided to the trainer's list of callbacks, then even ifTrainer.fit
is called without providingckpt_path
argument then theCheckpointConnector
will search for a checkpoint inOnExceptionCheckpoint.dirpath
, and if one is found it will be used to resume training.Further, the warnings shown at the beginning of
Trainer.fit
whenOnExceptionCheckpoint
callback is enabled IMO are incorrect, (see logs below)What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
The following warnings are confusing and arguably wrong:
PyTorch Lightning Version: 2.2.1
The text was updated successfully, but these errors were encountered: