Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Deprecate the _epoch_end hooks #8731

Closed
ananthsub opened this issue Aug 5, 2021 · 12 comments · Fixed by #16520
Closed

[RFC] Deprecate the _epoch_end hooks #8731

ananthsub opened this issue Aug 5, 2021 · 12 comments · Fixed by #16520
Labels
deprecation Includes a deprecation design Includes a design discussion discussion In a discussion stage
Projects
Milestone

Comments

@ananthsub
Copy link
Contributor

ananthsub commented Aug 5, 2021

We are auditing the Lightning components and APIs to assess opportunities for improvements:

#7740
https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA/edit#

Lightning has had some recent issues filed around these hooks:

  • training_epoch_end
  • validation_epoch_end
  • test_epoch_end
  • predict_epoch_end

Examples:

These hooks exist in order to accumulate the step-level outputs during the epoch for post-processing at the end of the epoch. However, we do not need these to be on the core LightningModule interface. Users can easily track outputs directly inside their implemented modules

Asking users to do this tracking offers major benefits:

  1. We avoid API confusion: for instance, when should users implement something in training_epoch_end vs on_train_epoch_end ? this can improve the onboarding experience (one less class of hooks to learn about, only 1 way to do things).
  2. This can also improve performance: if users implement something in training_epoch_end and don't use outputs, the trainer needlessly accumulates results, which wastes memory and risks OOMing. This is slowdown is not clearly visible to the user either, unless training completely fails, at which point this is a bad user experience.
  3. Reduced API surface area for the trainer reduces the risk of bugs like this. These bugs disproportionately hurt user trust because the control flow isn't visible to end users. Going the other way, removing this class of bugs has a disproportionate benefit to user trust.
  4. The current contract makes the trainer responsible for stewardship of data it doesn't directly use. Removing this support clarifies responsbilities and simplifies the loop internals.
  5. There's less "framework magic" at play, which means more readable user code because this tracking is explicit.
  6. Because the tracking is explicit, the responsibility of testing also falls to users, and in general we must encourage users to be able to test their code, and that the framework remains easily testable.

Cons:

  1. (marginally) more boilerplate code in LightningModules. For instance, users would need to pay attention to resetting the accumulated outputs (unless they explicitly want to accumulate results across epochs).

Proposal

  • Deprecate training_epoch_end, validation_epoch_end, andtest_epoch_end in v1.5
  • Remove these hooks entirely, and their corresponding calls in the loops in v1.7

This is how easily users can implement this in their LightningModule with the existing hooks:

class MyModel(LightningModule):
    def __init__(self):
        self._train_outputs = []  # <----- New
    
    def training_step(self, *args, **kwargs):
        ...
        output = ...
        self._train_outputs.append(output)  # <----- New
        return output
    
    def on_train_epoch_end(self) -> None:
        # process self._train_outputs
        self._train_outputs = [] # <----- New

so we're talking about 3 lines of code here per train/val/test/predict. I argue this is so minimal compared to the amount of logic that usually goes into post-processing the outputs anyways.

@PyTorchLightning/core-contributors

Originally posted by @ananthsub in #8690

@ananthsub ananthsub added the design Includes a design discussion label Aug 5, 2021
@ananthsub
Copy link
Contributor Author

one thing that will emerge is the hook call order between callbacks and the lightning module. logging something in the module's on_train_epoch_end needs to happen first for it to be used in the callback's on_train_epoch_end hook. this would be a change compared to today's behavior. this is required for items like logging a value at the end of the epoch in the module, and using that key/value as a monitor for callbacks like model checkpointing or early stopping.

so this will be related to #8506

@carmocca
Copy link
Contributor

carmocca commented Aug 5, 2021

Even though I agree with your arguments, people just love "framework magic". Removing this can make sense engineering-wise and design-wise but it can break user trust.

This is how easily users can implement this in their LightningModule with the existing hooks:

This does look simple but does not consider the cases where multiple optimizers are available. Basically, all the code we would remove would need to be implemented by the users who need it, depending on how complex is their training step.

If we were in a 0.x version I'd entirely agree with you, but we have to consider how used these hooks are and the degree of headaches that this change might impose on users


Have you considered making output aggregation opt-in? Requiring a LightningModule flag (similar to automatic_optimization).

If outputs=True, then the use of training_epoch_end is the same as before. If outputs=False, then using this hook is a way of dealing with the callback-model order.

@ananthsub
Copy link
Contributor Author

This does look simple but does not consider the cases where multiple optimizers are available. Basically, all the code we would remove would need to be implemented by the users who need it, depending on how complex is their training step.

I think that's also the point. If we want to support more flavors of training steps, the logic for handling these outputs in the framework gets more and more complex, when it could sit locally in the user's training step.

tracking the outputs per optimizer idx and dataloader idx in automatic optimization also shouldn't be significantly more work given these are accessible from the training step

@awaelchli
Copy link
Member

API confusion aside, is the memory problem really an issue? I thought we are not tracking outputs if the hook is not overridden. In that sense, we already have the opt-in choice. This is the part where I don't get the full argument for removing it.

Regardless, I believe it would be interesting to see a lightning module rnn manual-implementation without the built-in TBTT. If we have that, we can examine the amount of boilerplate required and get a better picture of what impact this deprecation has. Perhaps I could help here. I could try to contribute an example here.

@ananthsub
Copy link
Contributor Author

ananthsub commented Aug 5, 2021

API confusion aside, is the memory problem really an issue? I thought we are not tracking outputs if the hook is not overridden. In that sense, we already have the opt-in choice. This is the part where I don't get the full argument for removing it.

@awaelchi The memory issue is the biggest risk. Though it's opt in, the current hook order makes it very dangerous:

  • module.training_epoch_end
  • callback.on_train_epoch_end
  • module.on_train_epoch_end

if the user needs to log a value that's used in the callback's on train epoch end, then they currently are forced to do so in training_epoch_end (assuming they can't use training_step with self.log + on_epoch=True). because they implement this, right away this incurs a major performance hit (at best) or training failure (because of OOMs)

I believe your second comment is meant for #8732

@carmocca
Copy link
Contributor

carmocca commented Aug 6, 2021

if the user needs to log a value that's used in the callback's on train epoch end, then they currently are forced to do so in training_epoch_end

After the logger connector re-design, if you have 2 callbacks [A, B] and B reads a value off callback_metrics logged in A, it works even if they both do it on_train_epoch_end. This is because the callback_metrics get updated dynamically:

https://github.com/PyTorchLightning/pytorch-lightning/blob/69f287eb85c36412d5d4d6541bc25f6a75a977ea/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L297

So technically you are not forced to implement training_epoch_end if your callbacks have a proper order

@anhnht3
Copy link

anhnht3 commented Aug 7, 2021

Would this affect logging of torch_metrics, i.e., automatically calling metric.compute() at the end of each epoch?

@ananthsub
Copy link
Contributor Author

ananthsub commented Aug 7, 2021

@carmocca the scenario I imagine is the LightningModule logs value that's required for callbacks [A, B] to process in on_train_epoch_end, not callbacks themselves logging metrics themselves for use in other callbacks. Though it's nice that the new logging supports this, I'm not sure users should need to rely on the order of callback execution

@anhnht3 no this would not affect logging of torchmetrics

@ananthsub ananthsub added deprecation Includes a deprecation discussion In a discussion stage labels Aug 20, 2021
@tchaton tchaton added this to To do in Q3-5 Aug 22, 2021
@tchaton tchaton moved this from To do to Backlog in Sprint Q3-6: 6 Sep - 17 Sep Aug 30, 2021
@daniellepintz
Copy link
Contributor

What is the status of this issue? I saw it was added to a sprint in September

@carmocca
Copy link
Contributor

It's waiting for a final decision but It's not likely to be approved given the discussions we've had.

@ananthsub
Copy link
Contributor Author

Another issue from transforming the return data from steps in a way users don't expect: #9968

@ananthsub
Copy link
Contributor Author

ananthsub commented Jan 21, 2022

Now that Lightning supports infinite training, these hooks introduce greater risks of OOMs: #11554, #11480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Includes a deprecation design Includes a design discussion discussion In a discussion stage
Projects
No open projects
Status: Waiting
Q3-5
To do
Development

Successfully merging a pull request may close this issue.

5 participants