on_validation_epoch_end() invocation order #17131

jchia · 2023-03-17T17:52:12Z

Bug description

In v1.9, a model's validation_epoch_end() gets called before a Callback's on_validation_epoch_end(), but in v2.0, a model's on_validation_epoch_end() gets called after a Callback's on_validation_epoch_end().

In my use case, which worked under v1.9, there is a Callback that implements on_validation_epoch_end() where it reads the validation metric from trainer.logged_metrics, which has been updated with the validation metric by the model's validation_epoch_end(). It then checks whether there has been an improvement and logs that. In v2.0, this approach no longer works as the invocation order has changed.

One workaround is to do all this in the model itself and skip the Callback. However, I prefer to do the improvement checking and logging in a Callback instead of in the model because the necessary member variables would be useless if the model is not used for training (e.g. used only for testing or inference). Also, using a callback is a cleaner and more modular approach.

For my use case, it may be helpful to be able to express the priorities of the callbacks relative to one another and to the model. There may also be other, simpler, solutions.

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda @tchaton @justusschock @awaelchli @carmocca

The text was updated successfully, but these errors were encountered:

carmocca · 2023-03-21T14:40:25Z

For context, this is a limitation of the hook system. We had the same problem for the built-in callback hooks that monitor a metric (EarlyStopping and ModelCheckpoint) and we went for a hacky internal workaround to unblock ourselves: #16567

The issue above describes the same problem but for validation. The problem to address can be boiled down to this snippet:

callback_metrics = {}


class LightningModule:
    def on_validation_epoch_end(self):
        # writes the value
        callback_metrics["val_loss"] = 123


class Callback:
    def on_validation_epoch_end(self):
        # reads the value
        # raises KeyError
        print(callback_metrics["val_loss"])


class Trainer:
    def fit(self):
        ...

        # before 2.0, this was run here
        # pl_module.validation_epoch_end()

        # the callback hook is called first
        callback.on_validation_epoch_end()
        pl_module.on_validation_epoch_end()


pl_module = LightningModule()
callback = Callback()
trainer = Trainer()
trainer.fit()

I can think of various solutions (not necessarily good):

(a) Reverse the order of the hooks, meaning we call the pl_module first and then the callback. This would be a significant and unavoidable breaking change which we want to avoid at all costs now that 2.0 is out of the door. This will still have the same limitation, but the other way around.
(b) Add a Callback hook for this limitation: "callback.on_validation_epoch_end_after_pl_module"
(c) Introduce a mechanism so the user or callback writer can define the calling order for a specific hook.

jzazo · 2023-07-03T12:20:14Z

Hi. I am affected by this same issue now that I am trying to update to lightning 2.0.

For me, option a) is the one that makes more sense, because Callbacks is optional code, whereas a LightningModule hook always runs. So it is good it runs first because then you can rely on it in all callbacks. For example, if we are concatenating step outputs, we can rely on the final concatenated one, as you explain in your doc examples. It is also the behavior we have had since 1.9. The breaking change is now on version 2.0, but this change has not been motivated publicly, I think, or advertised clearly. It does not appear in the documentation either.

Option b) feels a bit of cluttering the code. Option c) is nice but would take more time to figure out how to do correctly.

Can we get the old behavior back, which in my opinion makes more sense? Thanks.

dannyfriar · 2023-07-25T20:00:04Z

Hi I'm also affected by this, is there a workaround?

jchia added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 17, 2023

carmocca added feature Is an improvement or enhancement design Includes a design discussion hooks Related to the hooks API and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 20, 2023

carmocca added this to the future milestone Mar 20, 2023

Borda added ver: 2.0.x and removed ver: 2.0.x labels May 3, 2023

carmocca mentioned this issue Jun 11, 2023

Keep User-Defined Order of Callbacks #15026

Open

awaelchli mentioned this issue Aug 10, 2023

LightningModule.on_train_batch_end executes after Callback.on_train_batch_end #18265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on_validation_epoch_end() invocation order #17131

on_validation_epoch_end() invocation order #17131

jchia commented Mar 17, 2023 •

edited by github-actions bot

carmocca commented Mar 21, 2023 •

edited

jzazo commented Jul 3, 2023 •

edited

dannyfriar commented Jul 25, 2023

on_validation_epoch_end() invocation order #17131

on_validation_epoch_end() invocation order #17131

Comments

jchia commented Mar 17, 2023 • edited by github-actions bot

Bug description

How to reproduce the bug

Error messages and logs

Environment

More info

carmocca commented Mar 21, 2023 • edited

jzazo commented Jul 3, 2023 • edited

dannyfriar commented Jul 25, 2023

jchia commented Mar 17, 2023 •

edited by github-actions bot

carmocca commented Mar 21, 2023 •

edited

jzazo commented Jul 3, 2023 •

edited