You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I discovered a bug in the windows version of Rsamtools, which does not occur in the same version on Ubuntu 18.04 or CentOS 7. It may be related to #5 and #12, but the examples I have might help with solving the issue.
I'm trying to extract sequences from a large reference database of virus references found here:
ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrl*.seq.gz (about 4.2GB in total) (my current database may differ a bit from the most recent database though, let me know if you would like to have the exact same database).
I'm using these same commands on the 3 systems, not all cases go wrong, but I have isolated a case in idx nr 2621935 which gives different results on windows as compared to Ubuntu and CentOS.
A DNAStringSet instance of length 1
width seq names
[1] 4979 CATGGGTATGGCATCAGTCCTAGATAAAGGGACTGGCAAGT...TTTAACTGTCGTATCTCTTGTCTGTGGTATACTTAGTCTAG EU851411.1
On the other two platforms I get this result (which is the correct sequence, the sequence on windows is incorrect):
A DNAStringSet instance of length 1
width seq names
[1] 4979 TTGAACAAGTAACCAGTCGTAAG...CTTACGACTGGTTACTTGTTCAA EU851411.1
The accession numbers match, but the sequences don't...
A second hint is the last sequence in the index, which the 2 non-windows platforms retrieve just fine, but the windows version gives an error:
Windows:
> scanFa(gbvrl_fa, idx[length(idx)])
Error in value[[3L]](cond) :
record 1 (FN398484.1:1-903) contains invalid DNA letters
file: gbvrl.nt.fasta
Other two:
The last hint is an error claiming that the reference is not present in the current index, which happens with some sequences:
Windows:
> scanFa(gbvrl_fa, idx[1400001])
Error in value[[3L]](cond) : record 1 (KY058615.1:1-1166) failed
file: gbvrl.nt.fasta
Other two:
Sequences in the index before around 1,300,000 seem to succeed at retrieving the correct sequence and sequences after around 1,400,000 seem to fail, I don't know if that makes sense.
I've checked the Rsamtools and dependencies's versions and they match. Also the md5sum of all the fasta and fasta.fai files match.
I hope that these examples shine some light on the issues.
I discovered a bug in the windows version of Rsamtools, which does not occur in the same version on Ubuntu 18.04 or CentOS 7. It may be related to #5 and #12, but the examples I have might help with solving the issue.
I'm trying to extract sequences from a large reference database of virus references found here:
ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrl*.seq.gz (about 4.2GB in total) (my current database may differ a bit from the most recent database though, let me know if you would like to have the exact same database).
I'm using these same commands on the 3 systems, not all cases go wrong, but I have isolated a case in idx nr 2621935 which gives different results on windows as compared to Ubuntu and CentOS.
On windows I get this result:
On the other two platforms I get this result (which is the correct sequence, the sequence on windows is incorrect):
The accession numbers match, but the sequences don't...
A second hint is the last sequence in the index, which the 2 non-windows platforms retrieve just fine, but the windows version gives an error:
Windows:
Other two:
The last hint is an error claiming that the reference is not present in the current index, which happens with some sequences:
Windows:
Other two:
Sequences in the index before around 1,300,000 seem to succeed at retrieving the correct sequence and sequences after around 1,400,000 seem to fail, I don't know if that makes sense.
I've checked the Rsamtools and dependencies's versions and they match. Also the md5sum of all the fasta and fasta.fai files match.
I hope that these examples shine some light on the issues.
CentOS 7
Ubuntu 18.04
The text was updated successfully, but these errors were encountered: