Skip to content

Allow disabling automatic stopping during fitting #8818

@EricWiener

Description

@EricWiener

🚀 Feature

Currently if neither max_epochs nor max_steps aren't set, lightning defaults to using max_epochs = 1000. However, in some situations the user actually doesn't want max_epochs or max_steps to be set, and automatically stopping after 1000 epochs isn't wanted. It would be great if there were a way to specify that no epochs should be set.

Motivation

I was running a very large training job over the weekend with a very large dataset. Because the dataset is so large, I set the number of batches per epoch to be very small (relative to the size of the dataset) so that logging would occur more frequently. I set max_epochs and max_steps to be None because I believed this would disable automatic stopping, on Monday when I checked on the model again it had exited early after only a couple of hours.

Pitch

Have some way to specify that automatic stopping should be disabled. Having to hard-code a large number like 2**1000 isn't the most elegant (especially compared to how elegant the rest of Lightning is).

Alternatives

The user could pass float("inf") as max_epochs to disable stopping if not using multiple GPUs. However, if using multiple GPUs they would need to use a very large integer instead (ex. 2**1000).

If using a float for max_epochs, you will get the following error:

Training: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/home/eric/.../train.runfiles/train.py", line 371, in <module>
    main(sys.argv[1:])
  File "/home/eric/.../train.runfiles/train.py", line 362, in main
    trainer.fit(model, data_module)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 861, in run_train
    epochs = range(self.current_epoch, self.max_epochs) if self.max_epochs else count(self.current_epoch)
TypeError: 'float' object cannot be interpreted as an integer

Additional context

https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1627425271133200

Metadata

Metadata

Assignees

Labels

featureIs an improvement or enhancementgood first issueGood for newcomershelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions