{ts} speed up stats #184

TomSmithCGAT · 2017-09-12T11:23:13Z

This pull request significantly improves the run time for dedup --output-stats. By way of an example, for a file containing ~4M read to be deduped, with ~1M 10bp UMIs observed, on the master branch running with stats takes ~17h to complete! Without stats the run time is a mere 147s. With the changes herein, the run time with stats is now 380s.

Currently, we are selecting random UMIs for each position separately, taking into account the frequency with which each UMI is observed in the input BAM. Since np.random.choice takes essentially the same amount of time to return 100,000 elements at random (with replacement) as it does for 1, the change here is use np.random.choice to create an array of 100,000 random UMIs from which we select each in turn when we need a random UMI, re-creating the array when we have used each element.

@IanSudbery - Would you mind reviewing this when you have a moment. The test file had to be updated since the random selection is not consistent with the previous method but this is unavoidable as far as I can see.

TomSmithCGAT added 2 commits September 12, 2017 11:25

speeds up random umi selection for stats extraction

20fbe1d

updates test files

9358893

TomSmithCGAT requested a review from IanSudbery September 12, 2017 11:23

TomSmithCGAT added 2 commits September 12, 2017 13:27

updates py2 test file

6b821bd

corrects random UMI sampling

c1cd867

TomSmithCGAT mentioned this pull request Sep 13, 2017

excessive dedup memory usage #173

Closed

TomSmithCGAT merged commit 3a3f011 into master Sep 13, 2017

TomSmithCGAT deleted the {TS}-SpeedUpStats branch October 13, 2017 10:59

SolKatzman mentioned this pull request Mar 26, 2020

excessive dedup memory usage with output-stats #409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ts} speed up stats #184

{ts} speed up stats #184

TomSmithCGAT commented Sep 12, 2017

{ts} speed up stats #184

{ts} speed up stats #184

Conversation

TomSmithCGAT commented Sep 12, 2017