fix: optimize with StreamingDataloader(parquet data) for >=5 workers #616

bhimrazy · 2025-06-06T18:54:13Z

What does this PR do?

Fixes #599
Also resolves #482

Fixes data loss when using optimize() with StreamingDataLoader on parquet data with 5+ workers.

The solution is admittedly a bit hacky but effective: we increase the total virtual items and let workers process their iterators until they're naturally exhausted, rather than depending on rigid index.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…th operation and prevent runtime error

codecov · 2025-06-06T19:25:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83%. Comparing base (d11b1c9) to head (f6e69f9).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #616   +/-   ##
===================================
  Coverage    83%    83%           
===================================
  Files        49     49           
  Lines      6750   6756    +6     
===================================
+ Hits       5604   5636   +32     
+ Misses     1146   1120   -26

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…uet data

for more information, see https://pre-commit.ci

bhimrazy · 2025-06-06T21:16:09Z

Okay, so now I’ve got the minimal test for this PR. The test is working as expected, and it failed on the CI machines.

Now, let’s add the fixes back.

bhimrazy · 2025-06-08T05:15:52Z

Okay, it looks like the test is now working as expected.
The real issue seems to be that the item count becomes unequal — which confirms that an item is indeed getting skipped.

And also , this only appears to happen with Parquet datasets.

Borda · 2025-06-11T13:35:45Z

Okay, it looks like the test is now working as expected.

Nice, will you also add a fix for it?

bhimrazy · 2025-06-11T18:56:14Z

Nice, will you also add a fix for it?

Yes, @Borda — I’m still investigating the root cause. So far, I haven’t found a reliable fix.

…tem remapping

…ing dataloader test

…data - Updated test to ensure correct processing of items with multiple workers. - Improved question and answer generation for clarity. - Added assertions to verify dataloader length and sample record structure.

…parquet data

for more information, see https://pre-commit.ci

tchaton

Nice !

bhimrazy added 3 commits June 7, 2025 00:37

fix: handle StopIteration in StreamingDataLoaderReader to ensure smoo…

3321822

…th operation and prevent runtime error

Merge branch 'main' into fix/optimize-with-streaming-dataloader

d6df1c6

update comment

2ceccbb

bhimrazy and others added 5 commits June 7, 2025 02:44

feat: add test for optimize function with StreamingDataLoader on parq…

8076480

…uet data

update test

db220c5

hide temporarily

21ffbd5

update test

85bf1af

[pre-commit.ci] auto fixes from pre-commit.com hooks

3427676

for more information, see https://pre-commit.ci

bhimrazy added 3 commits June 7, 2025 03:02

add the fix back

8197ed8

increase num of workers

d9c4463

update handle stop iteration

3331111

bhimrazy added 7 commits June 30, 2025 14:25

Merge branch 'main' into fix/optimize-with-streaming-dataloader

58f9d41

Merge branch 'main' into fix/optimize-with-streaming-dataloader

7e024cf

Merge branch 'main' into fix/optimize-with-streaming-dataloader

2b97a50

fix: handle None return from StreamingDataLoaderReader in BaseWorker

1e430ee

fix: handle StopIteration in StreamingDataLoaderReader and optimize i…

8897574

…tem remapping

fix: include index in optimized output and validate indexes in stream…

5254993

…ing dataloader test

bhimrazy changed the title ~~fix: handle StopIteration in StreamingDataLoaderReader to ensure smooth operation and prevent runtime error~~ fix: optimize with StreamingDataloader parquet data for >=5 workers Jul 3, 2025

bhimrazy changed the title ~~fix: optimize with StreamingDataloader parquet data for >=5 workers~~ fix: optimize with StreamingDataloader (parquet data) for >=5 workers Jul 3, 2025

bhimrazy changed the title ~~fix: optimize with StreamingDataloader (parquet data) for >=5 workers~~ fix: optimize with StreamingDataloader(parquet data) for >=5 workers Jul 3, 2025

test: skip optimization test on Windows for StreamingDataLoader with …

78fa177

…parquet data

bhimrazy marked this pull request as ready for review July 3, 2025 07:10

bhimrazy requested review from lantiga and tchaton as code owners July 3, 2025 07:10

bhimrazy requested review from Borda and justusschock as code owners July 3, 2025 07:10

[pre-commit.ci] auto fixes from pre-commit.com hooks

85a7f6a

for more information, see https://pre-commit.ci

bhimrazy requested a review from deependujha July 3, 2025 07:12

revert the change for data_processor.py::L859

f6e69f9

tchaton approved these changes Jul 3, 2025

View reviewed changes

tchaton merged commit 68bcbe6 into Lightning-AI:main Jul 3, 2025
35 checks passed

bhimrazy deleted the fix/optimize-with-streaming-dataloader branch July 3, 2025 08:16

bhimrazy restored the fix/optimize-with-streaming-dataloader branch July 3, 2025 08:17

bhimrazy deleted the fix/optimize-with-streaming-dataloader branch July 3, 2025 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: optimize with StreamingDataloader(parquet data) for >=5 workers #616

fix: optimize with StreamingDataloader(parquet data) for >=5 workers #616

Uh oh!

bhimrazy commented Jun 6, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 6, 2025 •

edited

Loading

Uh oh!

bhimrazy commented Jun 6, 2025 •

edited

Loading

Uh oh!

bhimrazy commented Jun 8, 2025

Uh oh!

Borda commented Jun 11, 2025

Uh oh!

bhimrazy commented Jun 11, 2025

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: optimize with StreamingDataloader(parquet data) for >=5 workers #616

fix: optimize with StreamingDataloader(parquet data) for >=5 workers #616

Uh oh!

Conversation

bhimrazy commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

codecov bot commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bhimrazy commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhimrazy commented Jun 8, 2025

Uh oh!

Borda commented Jun 11, 2025

Uh oh!

bhimrazy commented Jun 11, 2025

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bhimrazy commented Jun 6, 2025 •

edited

Loading

codecov bot commented Jun 6, 2025 •

edited

Loading

bhimrazy commented Jun 6, 2025 •

edited

Loading