Skip to content

Conversation

@itzhakstern
Copy link
Contributor

@itzhakstern itzhakstern commented Oct 16, 2025

What does this PR do?

Fixes an overflow issue that occurred when running Trainer.fit with validation enabled and a large number of epochs in combination with the ThroughputMonitor callback.

The problem was caused by incorrect computation of the training duration inside the on_validation_end method.
At the end of validation, ThroughputMonitor calculates both the validation duration and the time gap between training and validation in order to exclude it from the throughput calculation.
As part of this, it attempts to determine the total training time for the epoch by summing the values in the _time array.

However, this approach is incorrect because at each step, the _time array stores the cumulative time elapsed since t0, not incremental step durations. Therefore, summing the array results in an exaggerated total.

For example:
If t0 = 0 and an epoch has 5 steps, each taking 1 second, the _time array will look like [0, 1, 2, 3, 4, 5].
Summing this array yields 15 seconds, whereas the actual total training time is only 5 seconds.

Over a sufficiently large number of epochs, this summation error caused the accumulated time to grow without bound, eventually leading to a numeric overflow and runtime failure (as described in the linked issue).

The fix replaces the summation logic with use of the last element in _time, which correctly represents


In addition, ValueError: Expected the value to increase could occur in real (non-mocked) runs when validation was interleaved with training.
This happened because t0 was updated after validation, while the _time array still contained values based on the previous reference point.
This issue did not appear in the original tests, since they used mocked time.perf_counter values that always increase monotonically.

To address this, _start() now resets the internal arrays when trainer.state.fn == TrainerFn.FITTING, ensuring the throughput state is reinitialized after validation while still accumulating correctly during fitting.

Fixes #21257

Before submitting

Runing trainer.fit with a lot of epocks with the ThroughputMonitor was failed.

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 16, 2025
@itzhakstern itzhakstern marked this pull request as ready for review October 16, 2025 09:03
@deependujha deependujha merged commit b554e99 into Lightning-AI:master Oct 24, 2025
84 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

callback: throughput pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in ThroughputMonitor.on_validation_end: using sum() instead of last value can corrupt

5 participants