Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on_validation_epoch_end() invocation order #17131

Open
jchia opened this issue Mar 17, 2023 · 3 comments
Open

on_validation_epoch_end() invocation order #17131

jchia opened this issue Mar 17, 2023 · 3 comments
Labels
design Includes a design discussion feature Is an improvement or enhancement hooks Related to the hooks API
Milestone

Comments

@jchia
Copy link

jchia commented Mar 17, 2023

Bug description

In v1.9, a model's validation_epoch_end() gets called before a Callback's on_validation_epoch_end(), but in v2.0, a model's on_validation_epoch_end() gets called after a Callback's on_validation_epoch_end().

In my use case, which worked under v1.9, there is a Callback that implements on_validation_epoch_end() where it reads the validation metric from trainer.logged_metrics, which has been updated with the validation metric by the model's validation_epoch_end(). It then checks whether there has been an improvement and logs that. In v2.0, this approach no longer works as the invocation order has changed.

One workaround is to do all this in the model itself and skip the Callback. However, I prefer to do the improvement checking and logging in a Callback instead of in the model because the necessary member variables would be useless if the model is not used for training (e.g. used only for testing or inference). Also, using a callback is a cleaner and more modular approach.

For my use case, it may be helpful to be able to express the priorities of the callbacks relative to one another and to the model. There may also be other, simpler, solutions.

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda @tchaton @justusschock @awaelchli @carmocca

@jchia jchia added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 17, 2023
@carmocca carmocca added feature Is an improvement or enhancement design Includes a design discussion hooks Related to the hooks API and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 20, 2023
@carmocca carmocca added this to the future milestone Mar 20, 2023
@carmocca
Copy link
Member

carmocca commented Mar 21, 2023

For context, this is a limitation of the hook system. We had the same problem for the built-in callback hooks that monitor a metric (EarlyStopping and ModelCheckpoint) and we went for a hacky internal workaround to unblock ourselves: #16567

The issue above describes the same problem but for validation. The problem to address can be boiled down to this snippet:

callback_metrics = {}


class LightningModule:
    def on_validation_epoch_end(self):
        # writes the value
        callback_metrics["val_loss"] = 123


class Callback:
    def on_validation_epoch_end(self):
        # reads the value
        # raises KeyError
        print(callback_metrics["val_loss"])


class Trainer:
    def fit(self):
        ...

        # before 2.0, this was run here
        # pl_module.validation_epoch_end()

        # the callback hook is called first
        callback.on_validation_epoch_end()
        pl_module.on_validation_epoch_end()


pl_module = LightningModule()
callback = Callback()
trainer = Trainer()
trainer.fit()

I can think of various solutions (not necessarily good):

  • (a) Reverse the order of the hooks, meaning we call the pl_module first and then the callback. This would be a significant and unavoidable breaking change which we want to avoid at all costs now that 2.0 is out of the door. This will still have the same limitation, but the other way around.
  • (b) Add a Callback hook for this limitation: "callback.on_validation_epoch_end_after_pl_module"
  • (c) Introduce a mechanism so the user or callback writer can define the calling order for a specific hook.

@jzazo
Copy link

jzazo commented Jul 3, 2023

Hi. I am affected by this same issue now that I am trying to update to lightning 2.0.

For me, option a) is the one that makes more sense, because Callbacks is optional code, whereas a LightningModule hook always runs. So it is good it runs first because then you can rely on it in all callbacks. For example, if we are concatenating step outputs, we can rely on the final concatenated one, as you explain in your doc examples. It is also the behavior we have had since 1.9. The breaking change is now on version 2.0, but this change has not been motivated publicly, I think, or advertised clearly. It does not appear in the documentation either.

Option b) feels a bit of cluttering the code. Option c) is nice but would take more time to figure out how to do correctly.

Can we get the old behavior back, which in my opinion makes more sense? Thanks.

@dannyfriar
Copy link

Hi I'm also affected by this, is there a workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement hooks Related to the hooks API
Projects
None yet
Development

No branches or pull requests

5 participants