Lightning-AI · tchaton · Feb 16, 2021 · Feb 16, 2021 · Feb 16, 2021 · Feb 16, 2021
@@ -288,6 +288,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed passing wrong strings for scheduler interval doesn't throw an error ([#5923](https://github.com/PyTorchLightning/pytorch-lightning/pull/5923))
 
 
+- Fixed DDP hanging with ModelCheckpoint monitor None and `val_loss` being logged ([#6004](https://github.com/PyTorchLightning/pytorch-lightning/pull/6004))
+
+
 - Fixed missing `process_dataloader` call for `TPUSpawn` when in distributed mode ([#6015](https://github.com/PyTorchLightning/pytorch-lightning/pull/6015))
 
 

@@ -554,6 +554,12 @@ def _save_top_k_checkpoints(self, trainer, pl_module, metrics):
         epoch = metrics.get("epoch")
         step = metrics.get("step")
 
+        # when `val_loss` is being logged and no ModelCheckpoint is being provided
+        # `val_loss` or `checkpoint_on` will be selected for monitor and need to be reduced to
+        # prevent processes divergence
+        if self.monitor in ("val_loss", "checkpoint_on"):
+            current = trainer.training_type_plugin.reduce(current, reduce_op="mean")
+
         if self.check_monitor_top_k(current):
             self._update_best_and_save(current, epoch, step, trainer, pl_module, metrics)
         elif self.verbose:
Original file line number	Diff line number	Diff line change
Expand Up		@@ -288,6 +288,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
		- Fixed passing wrong strings for scheduler interval doesn't throw an error ([#5923](https://github.com/PyTorchLightning/pytorch-lightning/pull/5923))


		- Fixed DDP hanging with ModelCheckpoint monitor None and `val_loss` being logged ([#6004](https://github.com/PyTorchLightning/pytorch-lightning/pull/6004))


		- Fixed missing `process_dataloader` call for `TPUSpawn` when in distributed mode ([#6015](https://github.com/PyTorchLightning/pytorch-lightning/pull/6015))


Expand Down