Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request significantly improves the run time for
dedup --output-stats
. By way of an example, for a file containing ~4M read to be deduped, with ~1M 10bp UMIs observed, on the master branch running with stats takes ~17h to complete! Without stats the run time is a mere 147s. With the changes herein, the run time with stats is now 380s.Currently, we are selecting random UMIs for each position separately, taking into account the frequency with which each UMI is observed in the input BAM. Since
np.random.choice
takes essentially the same amount of time to return 100,000 elements at random (with replacement) as it does for 1, the change here is usenp.random.choice
to create an array of 100,000 random UMIs from which we select each in turn when we need a random UMI, re-creating the array when we have used each element.@IanSudbery - Would you mind reviewing this when you have a moment. The test file had to be updated since the random selection is not consistent with the previous method but this is unavoidable as far as I can see.