Make sampler length independent from consumed samples. #8576
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When resuming a training job, the pytorch lightning trainer seems to use the data loader to determine the number of total steps left in the epoch. When you kill/resume a training job partway through, the data sampler length will be reduced by
consumed_samples
in the current implementation, while the current step of the trainer will be restored. This results in a double impact to the number of remaining samples, and we can get into a situation where the trainer crashes out when the current_step is greater than the length of the dataloader. Leaving the length unchanged seems to fix this issue, and since it is an iterator, the stop_iteration (based on consumed_samples) does what it is supposed to do.What does this PR do ?
Leave len(sampler) as is, not making it a function of consumed_samples. Default behavior of pytorch lightning trainer resumption seems to behave better with this configuration.
Collection: all
Changelog
Usage
# Add a code snippet demonstrating how to use this
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information