Skip to content

Conversation

@bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Jun 6, 2025

What does this PR do?

Fixes #599
Also resolves #482

Fixes data loss when using optimize() with StreamingDataLoader on parquet data with 5+ workers.

The solution is admittedly a bit hacky but effective: we increase the total virtual items and let workers process their iterators until they're naturally exhausted, rather than depending on rigid index.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Jun 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83%. Comparing base (d11b1c9) to head (f6e69f9).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #616   +/-   ##
===================================
  Coverage    83%    83%           
===================================
  Files        49     49           
  Lines      6750   6756    +6     
===================================
+ Hits       5604   5636   +32     
+ Misses     1146   1120   -26     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Jun 6, 2025

Okay, so now I’ve got the minimal test for this PR. The test is working as expected, and it failed on the CI machines.
image

Now, let’s add the fixes back.

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Jun 8, 2025

Okay, it looks like the test is now working as expected.
The real issue seems to be that the item count becomes unequal — which confirms that an item is indeed getting skipped.

image

And also , this only appears to happen with Parquet datasets.

@Borda
Copy link
Collaborator

Borda commented Jun 11, 2025

Okay, it looks like the test is now working as expected.

Nice, will you also add a fix for it?

@bhimrazy
Copy link
Collaborator Author

Nice, will you also add a fix for it?

Yes, @Borda — I’m still investigating the root cause. So far, I haven’t found a reliable fix.

@bhimrazy bhimrazy changed the title fix: handle StopIteration in StreamingDataLoaderReader to ensure smooth operation and prevent runtime error fix: optimize with StreamingDataloader parquet data for >=5 workers Jul 3, 2025
@bhimrazy bhimrazy changed the title fix: optimize with StreamingDataloader parquet data for >=5 workers fix: optimize with StreamingDataloader (parquet data) for >=5 workers Jul 3, 2025
@bhimrazy bhimrazy changed the title fix: optimize with StreamingDataloader (parquet data) for >=5 workers fix: optimize with StreamingDataloader(parquet data) for >=5 workers Jul 3, 2025
@bhimrazy bhimrazy marked this pull request as ready for review July 3, 2025 07:10
@bhimrazy bhimrazy requested review from lantiga and tchaton as code owners July 3, 2025 07:10
@bhimrazy bhimrazy requested a review from deependujha July 3, 2025 07:12
Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

@tchaton tchaton merged commit 68bcbe6 into Lightning-AI:main Jul 3, 2025
35 checks passed
@bhimrazy bhimrazy deleted the fix/optimize-with-streaming-dataloader branch July 3, 2025 08:16
@bhimrazy bhimrazy restored the fix/optimize-with-streaming-dataloader branch July 3, 2025 08:17
@bhimrazy bhimrazy deleted the fix/optimize-with-streaming-dataloader branch July 3, 2025 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants