Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ts} speed up stats #184

Merged
merged 4 commits into from Sep 13, 2017
Merged

{ts} speed up stats #184

merged 4 commits into from Sep 13, 2017

Conversation

TomSmithCGAT
Copy link
Member

This pull request significantly improves the run time for dedup --output-stats. By way of an example, for a file containing ~4M read to be deduped, with ~1M 10bp UMIs observed, on the master branch running with stats takes ~17h to complete! Without stats the run time is a mere 147s. With the changes herein, the run time with stats is now 380s.

Currently, we are selecting random UMIs for each position separately, taking into account the frequency with which each UMI is observed in the input BAM. Since np.random.choice takes essentially the same amount of time to return 100,000 elements at random (with replacement) as it does for 1, the change here is use np.random.choice to create an array of 100,000 random UMIs from which we select each in turn when we need a random UMI, re-creating the array when we have used each element.

@IanSudbery - Would you mind reviewing this when you have a moment. The test file had to be updated since the random selection is not consistent with the previous method but this is unavoidable as far as I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant