Fix performance bottleneck in `train_test_split` #647

lukemerrick · 2025-07-01T21:34:02Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Speeds up the train_test_split function in cases where the large number of chunks makes repeated membership tests against a list painfully slow. The fix is simply to convert dummy_subsampled_chunk_filename into a set for O(1) lookup complexity.

The core bottleneck is here:

litData/src/litdata/utilities/train_test_split.py

Lines 62 to 64 in 1a3c1c1

    
           subsampled_chunks = [ 
        
               _org_chunk for _org_chunk in original_chunks if _org_chunk["filename"] in dummy_subsampled_chunk_filename 
        
           ]

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Issue: #648

Did you have fun?

Unfortunately finding this issue was not a fun process, since it seemed like this function is the last place to expect a performance regression, but fixing it was thankfully not a lot of work!

codecov · 2025-07-02T05:14:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83%. Comparing base (68bcbe6) to head (18b64af).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #647   +/-   ##
===================================
  Coverage    83%    83%           
===================================
  Files        49     49           
  Lines      6756   6756           
===================================
  Hits       5636   5636           
  Misses     1120   1120

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

deependujha

nice catch @lukemerrick. Thanks for contributing!🎉

bhimrazy

🚀

tchaton

Nice !

Fix bottleneck in membership check against list

16bf872

lukemerrick requested review from justusschock, lantiga and tchaton as code owners July 1, 2025 21:34

lukemerrick changed the title ~~Fix bottleneck in membership check against list~~ Fix performance bottleneck in membership check against list Jul 1, 2025

lukemerrick changed the title ~~Fix performance bottleneck in membership check against list~~ Fix performance bottleneck in train_test_split Jul 1, 2025

lukemerrick mentioned this pull request Jul 1, 2025

Performance bottleneck in train_test_split #648

Closed

lukemerrick closed this Jul 1, 2025

lukemerrick reopened this Jul 2, 2025

deependujha approved these changes Jul 2, 2025

View reviewed changes

deependujha enabled auto-merge (squash) July 2, 2025 07:39

bhimrazy approved these changes Jul 2, 2025

View reviewed changes

Borda approved these changes Jul 2, 2025

View reviewed changes

Merge branch 'main' into main

18b64af

tchaton approved these changes Jul 3, 2025

View reviewed changes

deependujha merged commit 1c51664 into Lightning-AI:main Jul 3, 2025
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix performance bottleneck in `train_test_split` #647

Fix performance bottleneck in `train_test_split` #647

Uh oh!

lukemerrick commented Jul 1, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

deependujha left a comment

Uh oh!

bhimrazy left a comment

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	subsampled_chunks = [
	_org_chunk for _org_chunk in original_chunks if _org_chunk["filename"] in dummy_subsampled_chunk_filename
	]

Fix performance bottleneck in train_test_split #647

Fix performance bottleneck in train_test_split #647

Uh oh!

Conversation

lukemerrick commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

codecov bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

deependujha left a comment

Choose a reason for hiding this comment

Uh oh!

bhimrazy left a comment

Choose a reason for hiding this comment

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix performance bottleneck in `train_test_split` #647

Fix performance bottleneck in `train_test_split` #647

lukemerrick commented Jul 1, 2025 •

edited

Loading

codecov bot commented Jul 2, 2025 •

edited

Loading