Skip to content

Conversation

@lukemerrick
Copy link
Contributor

@lukemerrick lukemerrick commented Jul 1, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Speeds up the train_test_split function in cases where the large number of chunks makes repeated membership tests against a list painfully slow. The fix is simply to convert dummy_subsampled_chunk_filename into a set for O(1) lookup complexity.

The core bottleneck is here:

subsampled_chunks = [
_org_chunk for _org_chunk in original_chunks if _org_chunk["filename"] in dummy_subsampled_chunk_filename
]

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Issue: #648

Did you have fun?

Unfortunately finding this issue was not a fun process, since it seemed like this function is the last place to expect a performance regression, but fixing it was thankfully not a lot of work!

@lukemerrick lukemerrick changed the title Fix bottleneck in membership check against list Fix performance bottleneck in membership check against list Jul 1, 2025
@lukemerrick lukemerrick changed the title Fix performance bottleneck in membership check against list Fix performance bottleneck in train_test_split Jul 1, 2025
@lukemerrick lukemerrick closed this Jul 1, 2025
@lukemerrick lukemerrick reopened this Jul 2, 2025
@codecov
Copy link

codecov bot commented Jul 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83%. Comparing base (68bcbe6) to head (18b64af).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #647   +/-   ##
===================================
  Coverage    83%    83%           
===================================
  Files        49     49           
  Lines      6756   6756           
===================================
  Hits       5636   5636           
  Misses     1120   1120           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch @lukemerrick. Thanks for contributing!🎉

@deependujha deependujha enabled auto-merge (squash) July 2, 2025 07:39
Copy link
Collaborator

@bhimrazy bhimrazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

@deependujha deependujha merged commit 1c51664 into Lightning-AI:main Jul 3, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants