Skip to content

Keep batches_yielded monotonic across StatefulDataLoader resumes#58

Open
mmshad wants to merge 1 commit intomainfrom
stateful-dataloader-monotonic-resume
Open

Keep batches_yielded monotonic across StatefulDataLoader resumes#58
mmshad wants to merge 1 commit intomainfrom
stateful-dataloader-monotonic-resume

Conversation

@mmshad
Copy link
Copy Markdown
Collaborator

@mmshad mmshad commented Apr 22, 2026

Summary

  • StatefulDataLoader.load_state_dict now writes to self._batches_yielded (previously read into a local and discarded).
  • __iter__ re-applies sampler.set_skip(self._batches_yielded * self.batch_size) on every call and no longer zeros _batches_yielded. The field advances monotonically within the epoch and resets only on StopIteration.
  • Combined, the second and later resumes within a single epoch stay aligned with a continuous-run ground truth.

Closes #57

Test plan

  • uv run pytest tests/integration/test_data_resumption.py -v: the new three-way resume test and all existing tests pass.
  • Full integration suite clean.

StatefulDataLoader._batches_yielded was reset to 0 on every __iter__ call
and not restored from state_dict. Combined with the sampler's single-shot
_skip (consumed inside iter() then zeroed), this made batches_yielded
mean "batches since last iter()" rather than "total batches in this
epoch." Save save resumes after the first one silently re-read only the
middle-run delta, misaligning the data order on the second and later
preemptions within the same epoch — the common case on busy clusters.

Two fixes:
- load_state_dict writes the saved value onto self._batches_yielded
  (previously read into a local and discarded).
- __iter__ re-applies the sampler skip based on the current
  _batches_yielded and no longer zeros it. The field now advances
  monotonically within the epoch and resets only on StopIteration.

Regression test covers three-way resume within one epoch; existing
single-resume, skip-ahead, and epoch-boundary tests still pass.
@mmshad mmshad requested a review from Naeemkh April 22, 2026 01:13
@mmshad mmshad self-assigned this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StatefulDataLoader loses alignment on second resume within an epoch

1 participant