Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

Closed
carmocca opened this issue Jan 20, 2022 · 7 comments · Fixed by #16520
Closed

Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

carmocca opened this issue Jan 20, 2022 · 7 comments · Fixed by #16520
Assignees
Labels
feature Is an improvement or enhancement good first issue Good for newcomers hooks Related to the hooks API trainer: argument

Comments

@carmocca
Copy link
Contributor

carmocca commented Jan 20, 2022

🚀 Feature

When the user configures Trainer(max_steps=-1, max_epochs=-1) an endless epoch runs, so overriding training_epoch_end or validation_epoch_end with val_check_interval==float can be a problem because they will keep outputs in memory indefinitely.

Motivation

Many users are not aware of the impact of overriding these hooks so infinite epochs open the door to "memory leaks".

Pitch

Raise a warning in this case informing the user of this behaviour.

Additional context

Proposed in #11480 (comment)


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @carmocca @awaelchli @ninginthecloud @daniellepintz @rohitgr7 @justusschock @kaushikb11

@carmocca carmocca added feature Is an improvement or enhancement hooks Related to the hooks API trainer: argument good first issue Good for newcomers labels Jan 20, 2022
@carmocca carmocca changed the title Show warning for infinitely running trainers that override epoch end hooks. Warn when running an infinite epoch and overriding "epoch end" accumulating hooks. Jan 20, 2022
@carmocca carmocca changed the title Warn when running an infinite epoch and overriding "epoch end" accumulating hooks. Warn when running an infinite epoch and overriding "epoch end" accumulating hooks Jan 20, 2022
@vedpatwardhan
Copy link

I'd like to work on this, could you assign this issue to me?

@tchaton
Copy link
Contributor

tchaton commented Jan 21, 2022

Hey @vedpatwardhan,

Yes, go on !

@akihironitta
Copy link
Contributor

akihironitta commented Jan 21, 2022

@carmocca I was thinking it also applies to the case of only max_steps=-1 being specified because it means an endless epoch, right?


Update: Well, I wasn't correct at all. If one specifies only max_steps=-1, max_epochs=1000 will be used (and thus it's not endless training), so no need to add the warning in this case.

Then, the 1.5 release note was quite confusing because it's not really an endless epoch with Trainer(max_steps=-1) but a complete epoch. https://github.com/PyTorchLightning/pytorch-lightning/releases/tag/1.5.0

Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.

Note: you will want to avoid logging with on_epoch=True in case of max_steps=-1.

@vedpatwardhan
Copy link

I had a doubt while writing a test for the change being made. In the test_config_validator.py file, inside the test_fit_val_loop_config function, if a warning is created, then the BoringModel class is instantiated, could you please guide me what that class is all about, and do I need to instantiate the BoringModel class for the warning generated too? Also, what other properties of the model do I need to change before calling trainer.fit(model)?
Thanks in advance.

@ananthsub
Copy link
Contributor

Here's one suggestion to test.

  • Create a new class that extends the BoringModel and implements the training_epoch_end.
  • Construct a Trainer which has settings configured for infinite training.
  • Call trainer.fit and confirm that you see a warning emitted from the configuration validator

You can follow the pattern for the validation_epoch_end and val_check_interval==float scenario described in the issue too.

@vedpatwardhan
Copy link

Okay, I'll do that.

@vedpatwardhan
Copy link

@ananthsub could you please help me understand what's going wrong? I think I've made the change, but many of the tests are failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers hooks Related to the hooks API trainer: argument
Projects
None yet
5 participants