Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

so slowly of extract sequence by coord #29

Closed
panxiaoguang opened this issue Sep 8, 2020 · 3 comments · Fixed by #68
Closed

so slowly of extract sequence by coord #29

panxiaoguang opened this issue Sep 8, 2020 · 3 comments · Fixed by #68

Comments

@panxiaoguang
Copy link

function getseq(chrom,start,stop)
           record=open(FASTA.Reader,"Project/DataBase/hg38.fa",index="Project/DataBase/hg38.fa.fai")
           sequence(record[chrom])[start:stop]
           close(record)
end
@time getseq("chr1",2345,2356)

if use samtools faidx ,it is much faster

@jakobnissen
Copy link
Member

From a quick skim, it looks like indexing is poorly optimized. I don't have time to change it right now, but a few observations:

  • When failing to find a record at the given offset, seekrecord will transverse the entire file byte by byte multiple times. That seems excessive to me - we should just look e.g. 100 bytes back and give up if it doesn't find it.
  • seekrecord could probably be simplified a bit. Just seek 100 bytes back, load in 100 bytes, and reverse-search for the next > symbol.
  • The most important aspect here: Perhaps it would be nice to have a function to extract part of a large sequence without reading the entire sequence into memory? One of the reasons why @panxiaoguang 's usecase is slow is because FASTX loads in the entire human chromosome 1 into memory. This seems much more tricky since we can't rely on readrecord! to do this - but it would still be nice to leverage the existing Automa parser. Something to mull over, definitely.

@jakobnissen
Copy link
Member

Actually we should just leverage the "linebase" and "linewidth" information from the index to do this in O(1) time. I can make a PR to make this efficient when I have time (probably in a week or so).

@panxiaoguang
Copy link
Author

Thanks a lot, maybe refer to pyfaidx or htslib!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants