[RFC] Deprecate the _epoch_end hooks #8731

ananthsub · 2021-08-05T02:05:47Z

We are auditing the Lightning components and APIs to assess opportunities for improvements:

#7740
https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#

Lightning has had some recent issues filed around these hooks:

training_epoch_end
validation_epoch_end
test_epoch_end
predict_epoch_end

Examples:

These hooks exist in order to accumulate the step-level outputs during the epoch for post-processing at the end of the epoch. However, we do not need these to be on the core LightningModule interface. Users can easily track outputs directly inside their implemented modules

Asking users to do this tracking offers major benefits:

We avoid API confusion: for instance, when should users implement something in training_epoch_end vs on_train_epoch_end ? this can improve the onboarding experience (one less class of hooks to learn about, only 1 way to do things).
This can also improve performance: if users implement something in training_epoch_end and don't use outputs, the trainer needlessly accumulates results, which wastes memory and risks OOMing. This is slowdown is not clearly visible to the user either, unless training completely fails, at which point this is a bad user experience.
Reduced API surface area for the trainer reduces the risk of bugs like this. These bugs disproportionately hurt user trust because the control flow isn't visible to end users. Going the other way, removing this class of bugs has a disproportionate benefit to user trust.
The current contract makes the trainer responsible for stewardship of data it doesn't directly use. Removing this support clarifies responsbilities and simplifies the loop internals.
There's less "framework magic" at play, which means more readable user code because this tracking is explicit.
Because the tracking is explicit, the responsibility of testing also falls to users, and in general we must encourage users to be able to test their code, and that the framework remains easily testable.

Cons:

(marginally) more boilerplate code in LightningModules. For instance, users would need to pay attention to resetting the accumulated outputs (unless they explicitly want to accumulate results across epochs).

Proposal

Deprecate training_epoch_end, validation_epoch_end, andtest_epoch_end in v1.5
Remove these hooks entirely, and their corresponding calls in the loops in v1.7

This is how easily users can implement this in their LightningModule with the existing hooks:

class MyModel(LightningModule):
    def __init__(self):
        self._train_outputs = []  # <----- New
    
    def training_step(self, *args, **kwargs):
        ...
        output = ...
        self._train_outputs.append(output)  # <----- New
        return output
    
    def on_train_epoch_end(self) -> None:
        # process self._train_outputs
        self._train_outputs = [] # <----- New

so we're talking about 3 lines of code here per train/val/test/predict. I argue this is so minimal compared to the amount of logic that usually goes into post-processing the outputs anyways.

@PyTorchLightning/core-contributors

Originally posted by @ananthsub in #8690

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-08-05T16:31:40Z

one thing that will emerge is the hook call order between callbacks and the lightning module. logging something in the module's on_train_epoch_end needs to happen first for it to be used in the callback's on_train_epoch_end hook. this would be a change compared to today's behavior. this is required for items like logging a value at the end of the epoch in the module, and using that key/value as a monitor for callbacks like model checkpointing or early stopping.

so this will be related to #8506

carmocca · 2021-08-05T16:40:43Z

Even though I agree with your arguments, people just love "framework magic". Removing this can make sense engineering-wise and design-wise but it can break user trust.

This is how easily users can implement this in their LightningModule with the existing hooks:

This does look simple but does not consider the cases where multiple optimizers are available. Basically, all the code we would remove would need to be implemented by the users who need it, depending on how complex is their training step.

If we were in a 0.x version I'd entirely agree with you, but we have to consider how used these hooks are and the degree of headaches that this change might impose on users

Have you considered making output aggregation opt-in? Requiring a LightningModule flag (similar to automatic_optimization).

If outputs=True, then the use of training_epoch_end is the same as before. If outputs=False, then using this hook is a way of dealing with the callback-model order.

ananthsub · 2021-08-05T17:45:05Z

This does look simple but does not consider the cases where multiple optimizers are available. Basically, all the code we would remove would need to be implemented by the users who need it, depending on how complex is their training step.

I think that's also the point. If we want to support more flavors of training steps, the logic for handling these outputs in the framework gets more and more complex, when it could sit locally in the user's training step.

tracking the outputs per optimizer idx and dataloader idx in automatic optimization also shouldn't be significantly more work given these are accessible from the training step

awaelchli · 2021-08-05T18:10:25Z

API confusion aside, is the memory problem really an issue? I thought we are not tracking outputs if the hook is not overridden. In that sense, we already have the opt-in choice. This is the part where I don't get the full argument for removing it.

Regardless, I believe it would be interesting to see a lightning module rnn manual-implementation without the built-in TBTT. If we have that, we can examine the amount of boilerplate required and get a better picture of what impact this deprecation has. Perhaps I could help here. I could try to contribute an example here.

ananthsub · 2021-08-05T19:19:06Z

API confusion aside, is the memory problem really an issue? I thought we are not tracking outputs if the hook is not overridden. In that sense, we already have the opt-in choice. This is the part where I don't get the full argument for removing it.

@awaelchi The memory issue is the biggest risk. Though it's opt in, the current hook order makes it very dangerous:

module.training_epoch_end
callback.on_train_epoch_end
module.on_train_epoch_end

if the user needs to log a value that's used in the callback's on train epoch end, then they currently are forced to do so in training_epoch_end (assuming they can't use training_step with self.log + on_epoch=True). because they implement this, right away this incurs a major performance hit (at best) or training failure (because of OOMs)

I believe your second comment is meant for #8732

carmocca · 2021-08-06T11:38:57Z

if the user needs to log a value that's used in the callback's on train epoch end, then they currently are forced to do so in training_epoch_end

After the logger connector re-design, if you have 2 callbacks [A, B] and B reads a value off callback_metrics logged in A, it works even if they both do it on_train_epoch_end. This is because the callback_metrics get updated dynamically:

https://github.com/PyTorchLightning/pytorch-lightning/blob/69f287eb85c36412d5d4d6541bc25f6a75a977ea/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L297

So technically you are not forced to implement training_epoch_end if your callbacks have a proper order

anhnht3 · 2021-08-07T17:46:00Z

Would this affect logging of torch_metrics, i.e., automatically calling metric.compute() at the end of each epoch?

ananthsub · 2021-08-07T18:11:22Z

@carmocca the scenario I imagine is the LightningModule logs value that's required for callbacks [A, B] to process in on_train_epoch_end, not callbacks themselves logging metrics themselves for use in other callbacks. Though it's nice that the new logging supports this, I'm not sure users should need to rely on the order of callback execution

@anhnht3 no this would not affect logging of torchmetrics

daniellepintz · 2021-10-17T22:12:34Z

What is the status of this issue? I saw it was added to a sprint in September

carmocca · 2021-10-18T00:12:46Z

It's waiting for a final decision but It's not likely to be approved given the discussions we've had.

ananthsub · 2021-11-19T17:17:15Z

Another issue from transforming the return data from steps in a way users don't expect: #9968

ananthsub · 2022-01-21T07:26:59Z

Now that Lightning supports infinite training, these hooks introduce greater risks of OOMs: #11554, #11480

ananthsub added the design Includes a design discussion label Aug 5, 2021

ananthsub added deprecation Includes a deprecation discussion In a discussion stage labels Aug 20, 2021

tchaton added this to To do in Q3-5 Aug 22, 2021

tchaton added this to To do in Sprint Q3-6: 6 Sep - 17 Sep Aug 30, 2021

tchaton moved this from To do to Backlog in Sprint Q3-6: 6 Sep - 17 Sep Aug 30, 2021

ananthsub mentioned this issue Sep 8, 2021

Add predict_epoch_end hook. #9380

Closed

carmocca mentioned this issue Sep 16, 2021

reduce loop structure leakage into the TrainingEpochLoop #9490

Merged

12 tasks

carmocca added this to the future milestone Feb 28, 2022

carmocca mentioned this issue Dec 6, 2022

OOM when implementing training_epoch_end #15514

Closed

carmocca mentioned this issue Jan 26, 2023

Remove memory-retaining epoch-end hooks #16520

Merged

carmocca closed this as completed in #16520 Feb 6, 2023

carmocca modified the milestones: future, 2.0 Feb 16, 2023

CarloLucibello mentioned this issue Mar 12, 2023

deprecate training_epoch_end CarloLucibello/Tsunami.jl#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Deprecate the _epoch_end hooks #8731

[RFC] Deprecate the _epoch_end hooks #8731

ananthsub commented Aug 5, 2021 •

edited

ananthsub commented Aug 5, 2021

carmocca commented Aug 5, 2021

ananthsub commented Aug 5, 2021

awaelchli commented Aug 5, 2021

ananthsub commented Aug 5, 2021 •

edited

carmocca commented Aug 6, 2021

anhnht3 commented Aug 7, 2021

ananthsub commented Aug 7, 2021 •

edited

daniellepintz commented Oct 17, 2021

carmocca commented Oct 18, 2021

ananthsub commented Nov 19, 2021

ananthsub commented Jan 21, 2022 •

edited

[RFC] Deprecate the _epoch_end hooks #8731

[RFC] Deprecate the _epoch_end hooks #8731

Comments

ananthsub commented Aug 5, 2021 • edited

Proposal

ananthsub commented Aug 5, 2021

carmocca commented Aug 5, 2021

ananthsub commented Aug 5, 2021

awaelchli commented Aug 5, 2021

ananthsub commented Aug 5, 2021 • edited

carmocca commented Aug 6, 2021

anhnht3 commented Aug 7, 2021

ananthsub commented Aug 7, 2021 • edited

daniellepintz commented Oct 17, 2021

carmocca commented Oct 18, 2021

ananthsub commented Nov 19, 2021

ananthsub commented Jan 21, 2022 • edited

ananthsub commented Aug 5, 2021 •

edited

ananthsub commented Aug 5, 2021 •

edited

ananthsub commented Aug 7, 2021 •

edited

ananthsub commented Jan 21, 2022 •

edited