Fix performance bottleneck in train_test_split
          #647
        
          
      
                
     Merged
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Before submitting
What does this PR do?
Speeds up the
train_test_splitfunction in cases where the large number of chunks makes repeated membership tests against alistpainfully slow. The fix is simply to convertdummy_subsampled_chunk_filenameinto asetfor O(1) lookup complexity.The core bottleneck is here:
litData/src/litdata/utilities/train_test_split.py
Lines 62 to 64 in 1a3c1c1
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Issue: #648
Did you have fun?
Unfortunately finding this issue was not a fun process, since it seemed like this function is the last place to expect a performance regression, but fixing it was thankfully not a lot of work!