Skip to content

Conversation

@dariocazzani
Copy link
Contributor

What does this PR do?

This PR introduces a sliding window iteration mode for dataset processing, providing an alternative to the existing chunking approach. Sliding windows create overlapping sequences that move by one token at a time, significantly increasing the number of training samples and improving context coverage.

Details

  • Add iteration_type parameter to training config with "chunking" or "sliding" options
  • Implement SlidingWindowDataset class for creating overlapping token windows
  • Update HFDataSource to handle both iteration modes via pattern matching
  • Add comprehensive test suite covering sliding window functionality
  • Sliding mode disabled for streaming datasets due to sequential constraints
  • Bump version to 0.5.0 to reflect new feature addition

Highlights

The key difference between iteration modes:

# Chunking: Non-overlapping blocks
# Input: "0123456789" with block_size=3
# Samples: [012], [345], [678]

# Sliding: Overlapping windows (stride=1)  
# Input: "0123456789" with block_size=3
# Samples: [012], [123], [234], [345], [456], [567], [678]

Usage in config:

training = ScratchGPTTraining(
    iteration_type="sliding",  # or "chunking" (default)
    # ... other params
)

* Introduce iteration_type parameter ("chunking"/"sliding") to training config and data sources
* Implement SlidingWindowDataset class for overlapping token windows (sliding by 1 position)
* Update HFDataSource to support both chunking (non-overlapping) and sliding (overlapping) modes
* Add comprehensive test coverage for sliding window functionality and edge cases
* Sliding mode unavailable for streaming datasets due to sequential processing constraints
* Bump version to 0.5.0 to reflect new feature addition
Copy link
Contributor

@ayeganov ayeganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ayeganov ayeganov merged commit 8015094 into main Sep 24, 2025
2 checks passed
@ayeganov ayeganov deleted the feature/sliding_window branch September 24, 2025 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants