-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Currently if neither max_epochs nor max_steps aren't set, lightning defaults to using max_epochs = 1000. However, in some situations the user actually doesn't want max_epochs or max_steps to be set, and automatically stopping after 1000 epochs isn't wanted. It would be great if there were a way to specify that no epochs should be set.
Motivation
I was running a very large training job over the weekend with a very large dataset. Because the dataset is so large, I set the number of batches per epoch to be very small (relative to the size of the dataset) so that logging would occur more frequently. I set max_epochs and max_steps to be None because I believed this would disable automatic stopping, on Monday when I checked on the model again it had exited early after only a couple of hours.
Pitch
Have some way to specify that automatic stopping should be disabled. Having to hard-code a large number like 2**1000 isn't the most elegant (especially compared to how elegant the rest of Lightning is).
Alternatives
The user could pass float("inf") as max_epochs to disable stopping if not using multiple GPUs. However, if using multiple GPUs they would need to use a very large integer instead (ex. 2**1000).
If using a float for max_epochs, you will get the following error:
Training: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/home/eric/.../train.runfiles/train.py", line 371, in <module>
main(sys.argv[1:])
File "/home/eric/.../train.runfiles/train.py", line 362, in main
trainer.fit(model, data_module)
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 861, in run_train
epochs = range(self.current_epoch, self.max_epochs) if self.max_epochs else count(self.current_epoch)
TypeError: 'float' object cannot be interpreted as an integer
Additional context
https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1627425271133200