Skip to content

Conversation

@fctb12
Copy link
Collaborator

@fctb12 fctb12 commented Sep 16, 2025

This PR speeds up dataloading when basal_mapping_strategy=batch.

  1. It speeds up the dataloader initialization from about 70s to 6s. This is accomplished through making sentence creation in _process_subset more efficient. We were previously create long masks for each sentence (there were millions of sentences when grouped by batch); this could be sped up by some upfront array sorting.
  2. It speeds up dataloader iteration from 21s to 5s (on 25 batches). In investigating this slowness, I discovered that repeated, non-local accesses to the h5 file caused slowness; the solution was to add an LRU cache on fetch_gene_expression.

@fctb12 fctb12 changed the title Speed up batch mapping strategy Speed up batch mapping Sep 16, 2025
Copy link
Collaborator

@abhinadduri abhinadduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, awesome work @fctb12 ! please merge after passing tests and lint

@fctb12 fctb12 merged commit f8e3de9 into main Sep 16, 2025
4 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants