so slowly of extract sequence by coord #29

panxiaoguang · 2020-09-08T05:57:52Z

function getseq(chrom,start,stop)
           record=open(FASTA.Reader,"Project/DataBase/hg38.fa",index="Project/DataBase/hg38.fa.fai")
           sequence(record[chrom])[start:stop]
           close(record)
end

@time getseq("chr1",2345,2356)

if use samtools faidx ,it is much faster

The text was updated successfully, but these errors were encountered:

jakobnissen · 2020-09-10T12:02:41Z

From a quick skim, it looks like indexing is poorly optimized. I don't have time to change it right now, but a few observations:

When failing to find a record at the given offset, seekrecord will transverse the entire file byte by byte multiple times. That seems excessive to me - we should just look e.g. 100 bytes back and give up if it doesn't find it.
seekrecord could probably be simplified a bit. Just seek 100 bytes back, load in 100 bytes, and reverse-search for the next > symbol.
The most important aspect here: Perhaps it would be nice to have a function to extract part of a large sequence without reading the entire sequence into memory? One of the reasons why @panxiaoguang 's usecase is slow is because FASTX loads in the entire human chromosome 1 into memory. This seems much more tricky since we can't rely on readrecord! to do this - but it would still be nice to leverage the existing Automa parser. Something to mull over, definitely.

jakobnissen · 2020-09-10T12:10:20Z

Actually we should just leverage the "linebase" and "linewidth" information from the index to do this in O(1) time. I can make a PR to make this efficient when I have time (probably in a week or so).

panxiaoguang · 2020-09-10T12:19:28Z

Thanks a lot, maybe refer to pyfaidx or htslib!

jakobnissen mentioned this issue Sep 10, 2020

Add constant time fasta index seeking [NEED REVIEW] #31

Merged

jakobnissen mentioned this issue Jul 28, 2022

Rewrite for v2 #68

Merged

jakobnissen closed this as completed in #68 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

so slowly of extract sequence by coord #29

so slowly of extract sequence by coord #29

panxiaoguang commented Sep 8, 2020

jakobnissen commented Sep 10, 2020

jakobnissen commented Sep 10, 2020

panxiaoguang commented Sep 10, 2020

so slowly of extract sequence by coord #29

so slowly of extract sequence by coord #29

Comments

panxiaoguang commented Sep 8, 2020

jakobnissen commented Sep 10, 2020

jakobnissen commented Sep 10, 2020

panxiaoguang commented Sep 10, 2020