New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getSeq versus Views runtime #37
Comments
hmm - digging in a bit more, is Views quicker simply because it's not actually getting the entire sequence of each region? when I actually get the seqs from the Views object, that's time-consuming too. Oh well!
I don't really need a DNAStringSet - character vector would be fine. Feel free to close this, although if you have suggestions about the most efficient way to get large numbers of sequences from a BSgenome, I'd love to hear them. thanks again! Janet |
Hi Janet,
Yep, that's why. The Note that doing
In that case, it seems that
The 2nd call to Not a huge deal, and a refactoring of H. |
thanks for taking a look - appreciate it. check this out: with the genome I've been working with the time difference is MUCH more pronounced than with hg38 (and this test only uses 100 regions). I suspect it's because it's a lot more fragmented than hg38 (56740 contigs versus 640 for hg38). so the speedup provided by using the 2bit file is VERY helpful in my case. thank you!
|
Wait.. what? That's crazy 💫 But yeah, that's because your genome has a lot of sequences. The Note that some BSgenome objects still use this old storage:
That's because they contain IUPAC ambiguity codes that are not supported by the 2bit format:
OTOH the So yes, time to modernize this. I'm moving this up in my TODO list. Let's keep this issue open until I get to this. H. |
I know! Crazy. I'm toying with the idea of just getting rid of some of the shorter contigs in this genome (I suspect the same issue is giving me trouble outside of R as well, with GATK). I'd prefer to do that kind of filtering further downstream in my analysis if I can. Anyway, it does seem worth an update. Not urgent from my perspective now that you've told me that nice trick. thanks again! |
hey,
An observation, rather than a bug: I noticed that
getSeq
runs much slower thanViews
when extracting sequences from a BSgenome using GRanges to specify the desired regions. I'll use Views for my project, but I'm wondering if getSeqs would be better if updated to use whatever code Views is using.An example is below. With hg38 I needed a large number of windows to see the problem. With the genome I'm actually working on, the problem is also evident with a much smaller number of windows (100 windows). That one is a very draft-y mammalian genome, with many many contigs - I forged the BSgenome package from NCBI genome GCF_014825515.1 .
thanks!
Janet
The text was updated successfully, but these errors were encountered: