Skip to content

Conversation

@MagellaX
Copy link
Contributor

Closes #327.
What does this PR do?

New capability – different batch-sizes per stream
CombinedStreamingDataset.set_batch_size(...) now accepts either
a single int (old behaviour), or
a Sequence[int] with one entry per wrapped StreamingDataset.
_CombinedDatasetIterator (for batching_method="per_stream") keeps yielding from the
selected stream until its own batch-size quota is reached, then switches to a new stream.
Implementation highlights
Core logic lives in:
src/litdata/utilities/base.py – validation & propagation of a Sequence[int].
src/litdata/streaming/combined.py – dataset-specific limit calculation and switching.
Single-int paths are untouched → 100 % backward-compatible.

Tests
Added test_combined_dataset_per_dataset_batch_size in tests/streaming/test_combined.py.
Parametrised with multiple batch-size lists; asserts that no stream emits more samples than its allotment.
All existing tests still pass (pytest -q).

Why is this useful?
Projects combining datasets with very different tensor sizes can now maximise GPU utilisation by
using a smaller batch on the “large” dataset and a larger one on the “small” dataset, without
resorting to gradient-accumulation work-arounds.

@deependujha
Copy link
Collaborator

Hi @MagellaX

Thanks for contributing to LitData! 🙌
Could you please take a look at the failing checks and fix the issues? Let me know if you need any help.

@tchaton
Copy link
Collaborator

tchaton commented Jul 10, 2025

Hey @MagellaX. Would you mind to fix the failing tests ?

@MagellaX MagellaX requested a review from Borda as a code owner July 10, 2025 10:09
@MagellaX MagellaX force-pushed the feat/combinedstreaming-batchsize branch from ac9ae0b to 437d6a2 Compare July 10, 2025 10:16
@MagellaX MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 45abd8a to 09039a1 Compare July 10, 2025 18:39
@codecov
Copy link

codecov bot commented Jul 10, 2025

Codecov Report

Attention: Patch coverage is 92.10526% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83%. Comparing base (fc59c8a) to head (ad95aca).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #635   +/-   ##
===================================
  Coverage    83%    83%           
===================================
  Files        49     49           
  Lines      6785   6812   +27     
===================================
+ Hits       5662   5688   +26     
- Misses     1123   1124    +1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@deependujha deependujha force-pushed the feat/combinedstreaming-batchsize branch from ae017e6 to 36fc097 Compare July 11, 2025 05:17
@MagellaX MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 36fc097 to 8eca9f5 Compare July 11, 2025 06:52
@bhimrazy
Copy link
Collaborator

Hi @MagellaX,
I've merged the latest changes from main to bring this branch up to date.
Whenever you get a chance, please pull the changes into your local branch before making any new commits to avoid conflicts.

Also, I'm reviewing the PR now 😊.

@MagellaX MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 370673d to 2c9f821 Compare July 11, 2025 07:25
Copy link
Collaborator

@bhimrazy bhimrazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far! Added a few thoughts and queries.

@MagellaX MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 65a5d8d to f5c3031 Compare July 11, 2025 20:45
@MagellaX MagellaX force-pushed the feat/combinedstreaming-batchsize branch from b1a84bc to 2c7eb99 Compare July 14, 2025 16:51
@MagellaX
Copy link
Contributor Author

Hey @tchaton @bhimrazy @lantiga @deependujha

Can you merge this PR! Everything is okay now!!

@tchaton tchaton merged commit fa2020e into Lightning-AI:main Jul 16, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use different batch sizes in CombinedStreamingDataset

4 participants