Speed up batch mapping #62

fctb12 · 2025-09-16T16:19:53Z

This PR speeds up dataloading when basal_mapping_strategy=batch.

It speeds up the dataloader initialization from about 70s to 6s. This is accomplished through making sentence creation in _process_subset more efficient. We were previously create long masks for each sentence (there were millions of sentences when grouped by batch); this could be sped up by some upfront array sorting.
It speeds up dataloader iteration from 21s to 5s (on 25 batches). In investigating this slowness, I discovered that repeated, non-local accesses to the h5 file caused slowness; the solution was to add an LRU cache on fetch_gene_expression.

abhinadduri

looks great, awesome work @fctb12 ! please merge after passing tests and lint

fctb12 added 4 commits September 15, 2025 16:31

10x dataloader speedup

7478be9

Speed up batch enumeration

2a0c392

Add another cache, clean up script

323a5e5

bump semvar

11b6445

fctb12 changed the title ~~Speed up batch mapping strategy~~ Speed up batch mapping Sep 16, 2025

abhinadduri approved these changes Sep 16, 2025

View reviewed changes

Add tests

6697627

fctb12 merged commit f8e3de9 into main Sep 16, 2025
4 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up batch mapping #62

Speed up batch mapping #62

Uh oh!

fctb12 commented Sep 16, 2025

Uh oh!

abhinadduri left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Speed up batch mapping #62

Speed up batch mapping #62

Uh oh!

Conversation

fctb12 commented Sep 16, 2025

Uh oh!

abhinadduri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants