-
Notifications
You must be signed in to change notification settings - Fork 79
feat(streaming): enable per-dataset batch-sizes in CombinedStreamingDataset
#635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(streaming): enable per-dataset batch-sizes in CombinedStreamingDataset
#635
Conversation
for more information, see https://pre-commit.ci
|
Hi @MagellaX Thanks for contributing to LitData! 🙌 |
|
Hey @MagellaX. Would you mind to fix the failing tests ? |
ac9ae0b to
437d6a2
Compare
45abd8a to
09039a1
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #635 +/- ##
===================================
Coverage 83% 83%
===================================
Files 49 49
Lines 6785 6812 +27
===================================
+ Hits 5662 5688 +26
- Misses 1123 1124 +1 🚀 New features to boost your workflow:
|
ae017e6 to
36fc097
Compare
36fc097 to
8eca9f5
Compare
|
Hi @MagellaX, Also, I'm reviewing the PR now 😊. |
370673d to
2c9f821
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far! Added a few thoughts and queries.
65a5d8d to
f5c3031
Compare
b1a84bc to
2c7eb99
Compare
|
Hey @tchaton @bhimrazy @lantiga @deependujha Can you merge this PR! Everything is okay now!! |
Closes #327.
What does this PR do?
New capability – different batch-sizes per stream
CombinedStreamingDataset.set_batch_size(...) now accepts either
a single int (old behaviour), or
a Sequence[int] with one entry per wrapped StreamingDataset.
_CombinedDatasetIterator (for batching_method="per_stream") keeps yielding from the
selected stream until its own batch-size quota is reached, then switches to a new stream.
Implementation highlights
Core logic lives in:
src/litdata/utilities/base.py – validation & propagation of a Sequence[int].
src/litdata/streaming/combined.py – dataset-specific limit calculation and switching.
Single-int paths are untouched → 100 % backward-compatible.
Tests
Added test_combined_dataset_per_dataset_batch_size in tests/streaming/test_combined.py.
Parametrised with multiple batch-size lists; asserts that no stream emits more samples than its allotment.
All existing tests still pass (pytest -q).
Why is this useful?
Projects combining datasets with very different tensor sizes can now maximise GPU utilisation by
using a smaller batch on the “large” dataset and a larger one on the “small” dataset, without
resorting to gradient-accumulation work-arounds.