Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Htsfree #4

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Htsfree #4

wants to merge 14 commits into from

Conversation

LTLA
Copy link
Owner

@LTLA LTLA commented Mar 31, 2019

@LTLA
Copy link
Owner Author

LTLA commented Mar 31, 2019

Well, so much for the BamFile idea. To recap, the hope was that we could open the BamFile at the start of the calling function, and pass the resulting object to the BAM file-reading functions. This would avoid the overhead of setting up the BAM file handle at every iteration of a file read.

However, this doesn't work when you ask for paired reads by chromosome, because readGAlignmentPairs will also search for mates on other chromosomes - forcing the file pointer forwards and skipping the other chromosomes entirely when the calling function loops to it .

We can get around it by open and closeing the BamFile at every per-chromosome iteration, but I wonder if this would defeat the performance benefit of using a BamFile in the first place... @mtmorgan?

@jmacdon
Copy link

jmacdon commented Apr 1, 2019 via email

@LTLA
Copy link
Owner Author

LTLA commented Apr 2, 2019

I had thought about that, but was unwilling to rely on the aligner's definition of what a proper pair was. Notwithstanding inter-chromosomal pairs, different aligners might use different metrics for defining a proper pair - the maximum allowable fragment length is one such parameter that comes to mind.

If we had to do some filtering, the INS field would be much more standard for getting rid of inter-chromosomal pairs (as well as large fragments), as mentioned in Bioconductor/GenomicAlignments#4.

@jmacdon
Copy link

jmacdon commented Apr 2, 2019 via email

@LTLA
Copy link
Owner Author

LTLA commented Apr 2, 2019

Oops! Well spotted, I got my names mixed up. I was referring to TLEN, which is known to (R)samtools as ISIZE (not entirely sure why those two have different names, but there we go).

@jmacdon
Copy link

jmacdon commented Apr 2, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 2, 2019 via email

@LTLA
Copy link
Owner Author

LTLA commented Apr 2, 2019

To be clear, my concern isn't about filtering out inter-chromosomal read pairs in memory; it's about avoiding them being read into memory at all. The current state of this PR does memory-level filtering, but it should theoretically be possible to get better performance by skipping them during I/O.

Of course, if the current PR status is sufficiently fast for your use cases, I'll merge. I can't really check until I get my new laptop - 2GB of RAM is not enough for genomics these days.

@LTLA
Copy link
Owner Author

LTLA commented Apr 5, 2019

Well... this is disappointing.

library(Rsamtools)
bf <- system.file("exdata", "rep1.bam", package="csaw")
H <- scanBamHeader(bf)[[1]]$targets
H
## chrA chrB chrC 
## 1298  870 1345 

handle <- BamFile(bf)
open(handle)

scanBam(handle, param=ScanBamParam(what="pos", which=GRanges("chrA", IRanges(1, H[1]))))
## $`chrA:1-1298`
## $`chrA:1-1298`$pos
##   [1]    3    4    6    8   10   12   12   12   13   13   17   20   21   23   26
## ... etc.

scanBam(handle, param=ScanBamParam(what="pos", which=GRanges("chrB", IRanges(1, H[2]))))
## $`chrB:1-870`
## $`chrB:1-870`$pos
## integer(0)

scanBam(handle, param=ScanBamParam(what="pos", which=GRanges("chrC", IRanges(1, H[3]))))
## $`chrC:1-1345`
## $`chrC:1-1345`$pos
## integer(0)

close(handle)

As you can see, trying to retrieve reads on an open BAM file handle that's already been searched by position... doesn't work, even if the ensuing calls refer to reads that should occur later in the file.

@mtmorgan
Copy link

mtmorgan commented Apr 5, 2019

Construct the GRanges up-front https://support.bioconductor.org/p/119631/#119632 ? I guess the costs of opening a file are parsing the index and seeking the position; opening the BamFile once and using GRanges saves the cost of index parsing. One could also probably implement aseek,BamFile method and I'm happy to do that if you open an issue on the Rsamtools repository. Or maybe it's a bug...

@jmacdon
Copy link

jmacdon commented Apr 5, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 5, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 5, 2019 via email

@mtmorgan
Copy link

mtmorgan commented Apr 5, 2019

@jmacdon can you describe what you mean at Bioconductor/Rsamtools#6 ?

@LTLA
Copy link
Owner Author

LTLA commented Apr 5, 2019

Thanks @mtmorgan, the below seems to work for me:

library(Rsamtools)
bf <- system.file("exdata", "rep1.bam", package="csaw")
H <- scanBamHeader(bf)[[1]]$targets
H
## chrA chrB chrC
## 1298  870 1345

handle <- BamFile(bf, yieldSize=1)
all.ref <- GRanges(names(H), IRanges(1, H))
param <- ScanBamParam(what="pos", which=all.ref)

open(handle)

scanBam(handle, param=param)

scanBam(handle, param=param)

scanBam(handle, param=param)

close(handle)

@jmacdon; yes, the idea would be to just swap in the existing read input functions with GenomicAlignments, without having to rewrite a whole lot of the surrounding context: while still preserving, as much as possible, the current performance characteristics. We're almost there; the performance degradation is acceptable for standard applications, it's just this scaffold case that sucks.

@jmacdon
Copy link

jmacdon commented Apr 5, 2019 via email

@LTLA
Copy link
Owner Author

LTLA commented Apr 6, 2019

@jmacdon One more roll of the dice. I've just switched windowCounts to use the scheme suggested by @mtmorgan; can you see how fast it runs on your scaffolds in single-end mode?

@jmacdon
Copy link

jmacdon commented Apr 9, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 9, 2019 via email

@LTLA
Copy link
Owner Author

LTLA commented Apr 10, 2019

... is it still running?

@jmacdon
Copy link

jmacdon commented Apr 10, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 10, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 10, 2019 via email

@jmacdon
Copy link

jmacdon commented Apr 10, 2019 via email

@LTLA
Copy link
Owner Author

LTLA commented Apr 10, 2019

Oh geez. Let me double-check my code, maybe I did something wrong.

@LTLA
Copy link
Owner Author

LTLA commented Apr 11, 2019

Well, I don't think I stuffed anything up. My small examples don't show any difference between this branch and master. The warning messages imply that windowCounts is incorrectly loading in many, many more reads, but I don't know why. Guess we'll just have to sit tight and wait for Bioconductor/Rsamtools#6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor to use GenomicAlignments Fix discard for paired end reads
3 participants