Fix calculate training time by summing all elapsed times instead the last one #21291

itzhakstern · 2025-10-16T07:52:10Z

What does this PR do?

Fixes an overflow issue that occurred when running Trainer.fit with validation enabled and a large number of epochs in combination with the ThroughputMonitor callback.

The problem was caused by incorrect computation of the training duration inside the on_validation_end method.
At the end of validation, ThroughputMonitor calculates both the validation duration and the time gap between training and validation in order to exclude it from the throughput calculation.
As part of this, it attempts to determine the total training time for the epoch by summing the values in the _time array.

However, this approach is incorrect because at each step, the _time array stores the cumulative time elapsed since t0, not incremental step durations. Therefore, summing the array results in an exaggerated total.

For example:
If t0 = 0 and an epoch has 5 steps, each taking 1 second, the _time array will look like [0, 1, 2, 3, 4, 5].
Summing this array yields 15 seconds, whereas the actual total training time is only 5 seconds.

Over a sufficiently large number of epochs, this summation error caused the accumulated time to grow without bound, eventually leading to a numeric overflow and runtime failure (as described in the linked issue).

The fix replaces the summation logic with use of the last element in _time, which correctly represents

In addition, ValueError: Expected the value to increase could occur in real (non-mocked) runs when validation was interleaved with training.
This happened because t0 was updated after validation, while the _time array still contained values based on the previous reference point.
This issue did not appear in the original tests, since they used mocked time.perf_counter values that always increase monotonically.

To address this, _start() now resets the internal arrays when trainer.state.fn == TrainerFn.FITTING, ensuring the throughput state is reinitialized after validation while still accumulating correctly during fitting.

Fixes #21257

Before submitting

Runing trainer.fit with a lot of epocks with the ThroughputMonitor was failed.

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

…taking the last calculation

…training-time-calc

tests/tests_pytorch/callbacks/test_throughput_monitor.py

…training-time-calc

fix: calculating training time by summing all differences instead of …

4b19400

…taking the last calculation

github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 16, 2025

itzhakstern marked this pull request as ready for review October 16, 2025 09:03

itzhakstern requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners October 16, 2025 09:03

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

4aecd2d

…training-time-calc

SkafteNicki reviewed Oct 16, 2025

View reviewed changes

tests/tests_pytorch/callbacks/test_throughput_monitor.py Outdated Show resolved Hide resolved

test timings

b0f71fe

itzhakstern requested a review from SkafteNicki October 17, 2025 08:09

itzhakstern and others added 2 commits October 17, 2025 15:08

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

f9d375a

…training-time-calc

changelog

80bcbef

SkafteNicki approved these changes Oct 20, 2025

View reviewed changes

SkafteNicki added the callback: throughput label Oct 20, 2025

justusschock approved these changes Oct 23, 2025

View reviewed changes

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

11dc3ad

…training-time-calc

Borda approved these changes Oct 23, 2025

View reviewed changes

Borda and others added 2 commits October 23, 2025 17:14

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

7ebe1a4

…training-time-calc

Merge branch 'master' into bugfix/21257_throughput-monitor-incorrect-…

8c3c779

…training-time-calc

deependujha merged commit b554e99 into Lightning-AI:master Oct 24, 2025
84 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix calculate training time by summing all elapsed times instead the last one #21291

Fix calculate training time by summing all elapsed times instead the last one #21291

itzhakstern commented Oct 16, 2025 •

edited by SkafteNicki

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Fix calculate training time by summing all elapsed times instead the last one #21291

Fix calculate training time by summing all elapsed times instead the last one #21291

Conversation

itzhakstern commented Oct 16, 2025 • edited by SkafteNicki Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

itzhakstern commented Oct 16, 2025 •

edited by SkafteNicki

Loading