🪟 Add sliding window iteration mode for datasets #21

dariocazzani · 2025-09-24T15:53:58Z

What does this PR do?

This PR introduces a sliding window iteration mode for dataset processing, providing an alternative to the existing chunking approach. Sliding windows create overlapping sequences that move by one token at a time, significantly increasing the number of training samples and improving context coverage.

Details

Add iteration_type parameter to training config with "chunking" or "sliding" options
Implement SlidingWindowDataset class for creating overlapping token windows
Update HFDataSource to handle both iteration modes via pattern matching
Add comprehensive test suite covering sliding window functionality
Sliding mode disabled for streaming datasets due to sequential constraints
Bump version to 0.5.0 to reflect new feature addition

Highlights

The key difference between iteration modes:

# Chunking: Non-overlapping blocks
# Input: "0123456789" with block_size=3
# Samples: [012], [345], [678]

# Sliding: Overlapping windows (stride=1)  
# Input: "0123456789" with block_size=3
# Samples: [012], [123], [234], [345], [456], [567], [678]

Usage in config:

training = ScratchGPTTraining(
    iteration_type="sliding",  # or "chunking" (default)
    # ... other params
)

* Introduce iteration_type parameter ("chunking"/"sliding") to training config and data sources * Implement SlidingWindowDataset class for overlapping token windows (sliding by 1 position) * Update HFDataSource to support both chunking (non-overlapping) and sliding (overlapping) modes * Add comprehensive test coverage for sliding window functionality and edge cases * Sliding mode unavailable for streaming datasets due to sequential processing constraints * Bump version to 0.5.0 to reflect new feature addition

ayeganov

LGTM!

dariocazzani requested a review from ayeganov September 24, 2025 15:54

ayeganov approved these changes Sep 24, 2025

View reviewed changes

ayeganov merged commit 8015094 into main Sep 24, 2025
2 checks passed

ayeganov deleted the feature/sliding_window branch September 24, 2025 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🪟 Add sliding window iteration mode for datasets #21

🪟 Add sliding window iteration mode for datasets #21

Uh oh!

dariocazzani commented Sep 24, 2025

Uh oh!

ayeganov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🪟 Add sliding window iteration mode for datasets #21

🪟 Add sliding window iteration mode for datasets #21

Uh oh!

Conversation

dariocazzani commented Sep 24, 2025

What does this PR do?

Details

Highlights

Usage in config:

Uh oh!

ayeganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants