Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Pass the `stage` argument of `Callback.{setup,teardown}` as a keyword ([#7973](https://github.com/PyTorchLightning/pytorch-lightning/pull/7973))


- Fixed move best score to device in EarlyStopping Callback ([#7959](https://github.com/PyTorchLightning/pytorch-lightning/pull/7959))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Fixed move best score to device in EarlyStopping Callback ([#7959](https://github.com/PyTorchLightning/pytorch-lightning/pull/7959))
- Fixed move best score to device in EarlyStopping Callback ([#7959](https://github.com/PyTorchLightning/pytorch-lightning/pull/7959))

as your comments suggest, this fix applies only for tpu right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It applies to other accelerators as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would I reproduce an error on a GPU for example?

why can't we do for example this:

self.best_score.to(current.device)

moving one tensor to the device but the other not in a monitor_op is only going to raise questions for the reader of this code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> torch_inf = torch.tensor(np.Inf)
>>> value = torch.tensor(5, device=xm.xla_device())
>>> torch.lt(value, torch_inf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: torch_xla/csrc/aten_xla_bridge.cpp:69 : Check failed: xtensor
*** Begin stack trace ***
	tensorflow::CurrentStackTrace()

Similar code for Cuda devices won't throw an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the following error (device with 2 gpus and using ddp)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

I also felt that self.best_score.to(current.device) was safer..



## [1.3.6] - 2021-06-15

### Fixed
Expand Down
6 changes: 3 additions & 3 deletions pytorch_lightning/callbacks/early_stopping.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ def _run_early_stopping_check(self, trainer) -> None:
# when in dev debugging
trainer.dev_debugger.track_early_stopping_history(self, current)

should_stop, reason = self._evalute_stopping_criteria(current)
should_stop, reason = self._evalute_stopping_criteria(current, trainer)

# stop every ddp process if any world process decides to stop
should_stop = trainer.training_type_plugin.reduce_boolean_decision(should_stop)
Expand All @@ -206,7 +206,7 @@ def _run_early_stopping_check(self, trainer) -> None:
if reason and self.verbose:
self._log_info(trainer, reason)

def _evalute_stopping_criteria(self, current: torch.Tensor) -> Tuple[bool, str]:
def _evalute_stopping_criteria(self, current: torch.Tensor, trainer: 'pl.Trainer') -> Tuple[bool, str]:
should_stop = False
reason = None
if self.check_finite and not torch.isfinite(current):
Expand All @@ -229,7 +229,7 @@ def _evalute_stopping_criteria(self, current: torch.Tensor) -> Tuple[bool, str]:
f" {self.monitor} = {current} {self.order_dict[self.mode]} {self.divergence_threshold}."
" Signaling Trainer to stop."
)
elif self.monitor_op(current - self.min_delta, self.best_score):
elif self.monitor_op(current - self.min_delta, self.best_score.to(trainer.lightning_module.device)):
should_stop = False
reason = self._improvement_message(current)
self.best_score = current
Expand Down