LRs updates are called at the end of a skipped epoch #21307

LTMeyer · 2025-10-20T13:31:59Z

What does this PR do?

This PR fixes learning rate not being updated at the end of epoch when on_train_batch_start returns -1.

It postpones the existing raise StopIteration after the learning rate update.

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs): Related issue LR is not updated when on_train_batch_start returns -1 #21296 hasn't been discussed yet.
Did you read the contributor guideline, Pull Request section? Yes
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary) No
Did you write any new necessary tests? (not for typos and docs) Yes
Did you verify new and existing tests pass locally with your changes? Yes
Did you list all the breaking changes introduced by this pull request? None
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors) Yes

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--21307.org.readthedocs.build/en/21307/

src/lightning/fabric/CHANGELOG.md

SkafteNicki

Could we please also add some testing that the changed logic then correctly updates the learning rate when response=-1?

SkafteNicki · 2025-10-23T04:40:36Z

src/lightning/pytorch/loops/training_epoch_loop.py

+            should_skip_rest_of_epoch = response == -1
+            # Signal this is the last batch for the current epoch
+            if should_skip_rest_of_epoch:
+                self.batch_progress.increment_by(0, is_last_batch=True)


What is the logic here from changing from
self.batch_progress.increment_processed()
to
self.batch_progress.increment_by(0, is_last_batch=True)
?

What is the logic here from changing from self.batch_progress.increment_processed()
to self.batch_progress.increment_by(0, is_last_batch=True)?

batch_progress.increment_by is the only method that can set is_last_batch to True, which is required to trigger the update of lrs in case of IterableDataset.
The increment_processed only increments the counters. In case of IterableDataset, for which the expected number of batches is not known, this may not be enough to detect the epoch has ended.
Indeed, the lrs are later updated only if num_ready_batches_reached returns True. It does return True if epoch_finished_on_ready or is_last_batch are True.

Could we please also add some testing that the changed logic then correctly updates the learning rate when response=-1?

I'll do it.

@SkafteNicki I've just added tests. Not sure about the most appropriate location though.
10d18d9 actually added tests:

to check last_batch is set when rest of epoch was skipped.

to check lr is being updated at the end of epoch when on_train_batch_start returns -1.

I've checked that these tests were indeed failing before the changes introduced in the current PR.

When `on_train_batch_start` returns -1, the rest of the epoch is skipped. The lr update should still happen at the end of the epoch. - Test is_last_batch has been set correctly - Test lr has been updated at the end of each epoch

LTMeyer added 4 commits October 20, 2025 10:41

fix: Update lr if train_batch_start returns -1

adb8780

fix: Rename variable: should_skip_rest_of_epoch

4fa5c05

chore: Update changelog

4a713e5

fix: Batch increment

71d2ebd

LTMeyer requested review from Borda, ethanwharris, justusschock, lantiga and tchaton as code owners October 20, 2025 13:32

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Oct 20, 2025

SkafteNicki reviewed Oct 23, 2025

View reviewed changes

src/lightning/fabric/CHANGELOG.md Outdated Show resolved Hide resolved

Apply suggestion from @SkafteNicki

60a7cd3

SkafteNicki reviewed Oct 23, 2025

View reviewed changes

LTMeyer added 2 commits October 23, 2025 13:55

test: Check lr is updated at the end of epoch

10d18d9

When `on_train_batch_start` returns -1, the rest of the epoch is skipped. The lr update should still happen at the end of the epoch. - Test is_last_batch has been set correctly - Test lr has been updated at the end of each epoch

Merge branch 'master' into 21296-fix-lr-update-on-train-epoch-skip

024933b

github-actions bot added the has conflicts label Oct 23, 2025

Merge branch 'master' into 21296-fix-lr-update-on-train-epoch-skip

a8d9f8b

github-actions bot removed the has conflicts label Oct 23, 2025

LTMeyer added 2 commits October 27, 2025 09:27

Merge branch 'master' into 21296-fix-lr-update-on-train-epoch-skip

222754d

doc: Add documentation for lr update

e115b84

LTMeyer requested a review from SkafteNicki October 27, 2025 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LRs updates are called at the end of a skipped epoch #21307

LRs updates are called at the end of a skipped epoch #21307

LTMeyer commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

SkafteNicki left a comment

Uh oh!

SkafteNicki Oct 23, 2025

Uh oh!

LTMeyer Oct 23, 2025

Uh oh!

LTMeyer Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

LRs updates are called at the end of a skipped epoch #21307

Are you sure you want to change the base?

LRs updates are called at the end of a skipped epoch #21307

Conversation

LTMeyer commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

Uh oh!

SkafteNicki left a comment

Choose a reason for hiding this comment

Uh oh!

SkafteNicki Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

LTMeyer Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

LTMeyer Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LTMeyer commented Oct 20, 2025 •

edited

Loading