feat(streaming): enable per-dataset batch-sizes in `CombinedStreamingDataset` #635

MagellaX · 2025-06-24T17:08:30Z

Closes #327.
What does this PR do?

New capability – different batch-sizes per stream
CombinedStreamingDataset.set_batch_size(...) now accepts either
a single int (old behaviour), or
a Sequence[int] with one entry per wrapped StreamingDataset.
_CombinedDatasetIterator (for batching_method="per_stream") keeps yielding from the
selected stream until its own batch-size quota is reached, then switches to a new stream.
Implementation highlights
Core logic lives in:
src/litdata/utilities/base.py – validation & propagation of a Sequence[int].
src/litdata/streaming/combined.py – dataset-specific limit calculation and switching.
Single-int paths are untouched → 100 % backward-compatible.

Tests
Added test_combined_dataset_per_dataset_batch_size in tests/streaming/test_combined.py.
Parametrised with multiple batch-size lists; asserts that no stream emits more samples than its allotment.
All existing tests still pass (pytest -q).

Why is this useful?
Projects combining datasets with very different tensor sizes can now maximise GPU utilisation by
using a smaller batch on the “large” dataset and a larger one on the “small” dataset, without
resorting to gradient-accumulation work-arounds.

…ataset (Lightning-AI#327)

for more information, see https://pre-commit.ci

deependujha · 2025-06-24T18:39:25Z

Hi @MagellaX

Thanks for contributing to LitData! 🙌
Could you please take a look at the failing checks and fix the issues? Let me know if you need any help.

tchaton · 2025-07-10T06:57:24Z

Hey @MagellaX. Would you mind to fix the failing tests ?

codecov · 2025-07-10T21:28:23Z

Codecov Report

Attention: Patch coverage is 92.10526% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83%. Comparing base (fc59c8a) to head (ad95aca).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #635   +/-   ##
===================================
  Coverage    83%    83%           
===================================
  Files        49     49           
  Lines      6785   6812   +27     
===================================
+ Hits       5662   5688   +26     
- Misses     1123   1124    +1

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…nt]]

bhimrazy · 2025-07-11T07:05:57Z

Hi @MagellaX,
I've merged the latest changes from main to bring this branch up to date.
Whenever you get a chance, please pull the changes into your local branch before making any new commits to avoid conflicts.

Also, I'm reviewing the PR now 😊.

bhimrazy

Looking good so far! Added a few thoughts and queries.

src/litdata/streaming/combined.py

src/litdata/utilities/base.py

for more information, see https://pre-commit.ci

MagellaX · 2025-07-16T12:58:32Z

Hey @tchaton @bhimrazy @lantiga @deependujha

Can you merge this PR! Everything is okay now!!

feat(streaming): per-dataset batch-size support in CombinedStreamingD…

38e688d

…ataset (Lightning-AI#327)

MagellaX requested review from justusschock, lantiga and tchaton as code owners June 24, 2025 17:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

c7f53fe

for more information, see https://pre-commit.ci

MagellaX requested a review from Borda as a code owner July 10, 2025 10:09

feat(streaming): add per-dataset batch-size support and fix mypy issues

437d6a2

MagellaX force-pushed the feat/combinedstreaming-batchsize branch from ac9ae0b to 437d6a2 Compare July 10, 2025 10:16

fix(streaming): always switch dataset once per-stream quota is met

09039a1

MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 45abd8a to 09039a1 Compare July 10, 2025 18:39

deependujha force-pushed the feat/combinedstreaming-batchsize branch from ae017e6 to 36fc097 Compare July 11, 2025 05:17

chore(typing): align batch_size annotation with Union[int, Sequence[i…

8eca9f5

…nt]]

MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 36fc097 to 8eca9f5 Compare July 11, 2025 06:52

fix(typing): ensure int batch_size passed to get_len for mypy

2c9f821

MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 370673d to 2c9f821 Compare July 11, 2025 07:25

bhimrazy reviewed Jul 11, 2025

View reviewed changes

src/litdata/streaming/combined.py Show resolved Hide resolved

src/litdata/streaming/combined.py Show resolved Hide resolved

src/litdata/utilities/base.py Show resolved Hide resolved

chore(typing): remove redundant casts flagged by mypy

f5c3031

MagellaX force-pushed the feat/combinedstreaming-batchsize branch from 65a5d8d to f5c3031 Compare July 11, 2025 20:45

style(ruff): replace typing.List/Dict with built-in generics

2c7eb99

MagellaX force-pushed the feat/combinedstreaming-batchsize branch from b1a84bc to 2c7eb99 Compare July 14, 2025 16:51

MagellaX and others added 2 commits July 14, 2025 22:24

Merge branch 'main' into feat/combinedstreaming-batchsize

043c931

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad95aca

for more information, see https://pre-commit.ci

tchaton approved these changes Jul 16, 2025

View reviewed changes

tchaton merged commit fa2020e into Lightning-AI:main Jul 16, 2025
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(streaming): enable per-dataset batch-sizes in `CombinedStreamingDataset` #635

feat(streaming): enable per-dataset batch-sizes in `CombinedStreamingDataset` #635

Uh oh!

MagellaX commented Jun 24, 2025

Uh oh!

deependujha commented Jun 24, 2025

Uh oh!

tchaton commented Jul 10, 2025

Uh oh!

codecov bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

bhimrazy commented Jul 11, 2025

Uh oh!

bhimrazy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MagellaX commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(streaming): enable per-dataset batch-sizes in CombinedStreamingDataset #635

feat(streaming): enable per-dataset batch-sizes in CombinedStreamingDataset #635

Uh oh!

Conversation

MagellaX commented Jun 24, 2025

Uh oh!

deependujha commented Jun 24, 2025

Uh oh!

tchaton commented Jul 10, 2025

Uh oh!

codecov bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bhimrazy commented Jul 11, 2025

Uh oh!

bhimrazy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MagellaX commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(streaming): enable per-dataset batch-sizes in `CombinedStreamingDataset` #635

feat(streaming): enable per-dataset batch-sizes in `CombinedStreamingDataset` #635

codecov bot commented Jul 10, 2025 •

edited

Loading