Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize some view as pairs circumstances #9

Closed
cmdcolin opened this issue Nov 29, 2018 · 6 comments
Closed

Optimize some view as pairs circumstances #9

cmdcolin opened this issue Nov 29, 2018 · 6 comments

Comments

@cmdcolin
Copy link
Contributor

There are some areas with use of viewAsPairs where some excessive CPU usage is generated. The algorithm for determining which redispatch requests are made is sort of a brute force tool, calling getEntriesForRange for each unmatched read, and then it de-duplicates the results

In one area of a long insert size test file, I find this region of code comes up with https://github.com/GMOD/cram-js/blob/master/src/indexedCramFile.js#L94-L113

-541996 chunks before duplication
-32 after deduplication

This issue also applies to bam-js code

@cmdcolin
Copy link
Contributor Author

Note that this large case is uncommon, most of the time there are very few unmatched reads and then we get only about 30-70 chunks before de-duplication and 1-5 after

@cmdcolin
Copy link
Contributor Author

This happens in a sort of "fountain" region

screenshot-localhost-2018 11 29-11-18-57

@cmdcolin
Copy link
Contributor Author

Even if I bucket-ize the getEntriesForRange requests, the returned results from getEntriesForRange results in 30,000 slices, about a gigabyte of memory usage and breaking 60mb fetchSizeLimit

@rbuels
Copy link
Contributor

rbuels commented Nov 30, 2018

the only thing I can think of to do would be to have some kind of hard limit that will throw an exception

@cmdcolin
Copy link
Contributor Author

It seems like the fetchSizeLimit might be a reasonable defense from this. There seems like there are 300,000 reads in a 700 bp region. Potentially the algorithm that I refer to above as brute force could be optimized but I dunno how much we can do in our case, it seems like we're doing our best (it just requires a minute to download results)

@cmdcolin
Copy link
Contributor Author

samtools view out.sorted.bam 1:44388266-44389062|wc -l = 303130 😮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants