Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed passing wrong strings for scheduler interval doesn't throw an error ([#5923](https://github.com/PyTorchLightning/pytorch-lightning/pull/5923))


- Fixed DDP hanging with ModelCheckpoint monitor None and `val_loss` being logged ([#6004](https://github.com/PyTorchLightning/pytorch-lightning/pull/6004))


- Fixed missing `process_dataloader` call for `TPUSpawn` when in distributed mode ([#6015](https://github.com/PyTorchLightning/pytorch-lightning/pull/6015))


Expand Down
6 changes: 6 additions & 0 deletions pytorch_lightning/callbacks/model_checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -554,6 +554,12 @@ def _save_top_k_checkpoints(self, trainer, pl_module, metrics):
epoch = metrics.get("epoch")
step = metrics.get("step")

# when `val_loss` is being logged and no ModelCheckpoint is being provided
# `val_loss` or `checkpoint_on` will be selected for monitor and need to be reduced to
# prevent processes divergence
if self.monitor in ("val_loss", "checkpoint_on"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we reduce it always?

Copy link
Contributor Author

@tchaton tchaton Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics already perform reduction during compute or self.log.
I think the issue only happens with legacy metrics.
@awaelchli Should it be always be done ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. I thought the metrics that the checkpoint gets from the trainer/logger connector are already reduced? We shouldn't let the checkpoint have the responsibility to reduce metrics or to assume how. the mean is not always correct.

Copy link
Contributor

@SeanNaren SeanNaren Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get a test case of this legacy metric? It will help debug and figure out why this reduction is necessary
loss should always be reduced I think, so I'm a bit puzzled how this would fix an issue

current = trainer.training_type_plugin.reduce(current, reduce_op="mean")

if self.check_monitor_top_k(current):
self._update_best_and_save(current, epoch, step, trainer, pl_module, metrics)
elif self.verbose:
Expand Down