Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

carmocca · 2022-01-20T17:24:02Z

🚀 Feature

When the user configures Trainer(max_steps=-1, max_epochs=-1) an endless epoch runs, so overriding training_epoch_end or validation_epoch_end with val_check_interval==float can be a problem because they will keep outputs in memory indefinitely.

Motivation

Many users are not aware of the impact of overriding these hooks so infinite epochs open the door to "memory leaks".

Pitch

Raise a warning in this case informing the user of this behaviour.

Additional context

Proposed in #11480 (comment)

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @carmocca @awaelchli @ninginthecloud @daniellepintz @rohitgr7 @justusschock @kaushikb11

The text was updated successfully, but these errors were encountered:

vedpatwardhan · 2022-01-21T05:18:33Z

I'd like to work on this, could you assign this issue to me?

tchaton · 2022-01-21T07:22:36Z

Hey @vedpatwardhan,

Yes, go on !

akihironitta · 2022-01-21T16:18:32Z

@carmocca ~~I was thinking it also applies to the case of only max_steps=-1 being specified because it means an endless epoch, right?~~

Update: Well, I wasn't correct at all. If one specifies only max_steps=-1, max_epochs=1000 will be used (and thus it's not endless training), so no need to add the warning in this case.

Then, the 1.5 release note was quite confusing because it's not really an endless epoch with Trainer(max_steps=-1) but a complete epoch. https://github.com/PyTorchLightning/pytorch-lightning/releases/tag/1.5.0

Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.

Note: you will want to avoid logging with on_epoch=True in case of max_steps=-1.

vedpatwardhan · 2022-01-22T04:12:19Z

I had a doubt while writing a test for the change being made. In the test_config_validator.py file, inside the test_fit_val_loop_config function, if a warning is created, then the BoringModel class is instantiated, could you please guide me what that class is all about, and do I need to instantiate the BoringModel class for the warning generated too? Also, what other properties of the model do I need to change before calling trainer.fit(model)?
Thanks in advance.

ananthsub · 2022-01-22T04:53:45Z

Here's one suggestion to test.

Create a new class that extends the BoringModel and implements the training_epoch_end.
Construct a Trainer which has settings configured for infinite training.
Call trainer.fit and confirm that you see a warning emitted from the configuration validator

You can follow the pattern for the validation_epoch_end and val_check_interval==float scenario described in the issue too.

vedpatwardhan · 2022-01-22T05:56:37Z

Okay, I'll do that.

vedpatwardhan · 2022-01-24T08:27:34Z

@ananthsub could you please help me understand what's going wrong? I think I've made the change, but many of the tests are failing.

carmocca added feature Is an improvement or enhancement hooks Related to the hooks API trainer: argument good first issue Good for newcomers labels Jan 20, 2022

carmocca mentioned this issue Jan 20, 2022

PyTorch Profiler leaks memory #11480

Closed

carmocca changed the title ~~Show warning for infinitely running trainers that override epoch end hooks.~~ Warn when running an infinite epoch and overriding "epoch end" accumulating hooks. Jan 20, 2022

carmocca changed the title ~~Warn when running an infinite epoch and overriding "epoch end" accumulating hooks.~~ Warn when running an infinite epoch and overriding "epoch end" accumulating hooks Jan 20, 2022

carmocca assigned vedpatwardhan Jan 21, 2022

vedpatwardhan mentioned this issue Jan 21, 2022

Added a warning message for infinite epochs #11564

Closed

12 tasks

ananthsub mentioned this issue Jan 21, 2022

[RFC] Deprecate the _epoch_end hooks #8731

Closed

carmocca mentioned this issue Jan 26, 2023

Remove memory-retaining epoch-end hooks #16520

Merged

carmocca closed this as completed in #16520 Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

carmocca commented Jan 20, 2022 •

edited by github-actions bot

vedpatwardhan commented Jan 21, 2022

tchaton commented Jan 21, 2022

akihironitta commented Jan 21, 2022 •

edited

vedpatwardhan commented Jan 22, 2022

ananthsub commented Jan 22, 2022

vedpatwardhan commented Jan 22, 2022

vedpatwardhan commented Jan 24, 2022

Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

Warn when running an infinite epoch and overriding "epoch end" accumulating hooks #11554

Comments

carmocca commented Jan 20, 2022 • edited by github-actions bot

🚀 Feature

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

vedpatwardhan commented Jan 21, 2022

tchaton commented Jan 21, 2022

akihironitta commented Jan 21, 2022 • edited

vedpatwardhan commented Jan 22, 2022

ananthsub commented Jan 22, 2022

vedpatwardhan commented Jan 22, 2022

vedpatwardhan commented Jan 24, 2022

carmocca commented Jan 20, 2022 •

edited by github-actions bot

akihironitta commented Jan 21, 2022 •

edited