Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in windows version of Rsamtools Fasta sequence retrieval #14

Open
dnieuw opened this issue Oct 16, 2019 · 0 comments
Open

Bug in windows version of Rsamtools Fasta sequence retrieval #14

dnieuw opened this issue Oct 16, 2019 · 0 comments

Comments

@dnieuw
Copy link

dnieuw commented Oct 16, 2019

I discovered a bug in the windows version of Rsamtools, which does not occur in the same version on Ubuntu 18.04 or CentOS 7. It may be related to #5 and #12, but the examples I have might help with solving the issue.

I'm trying to extract sequences from a large reference database of virus references found here:
ftp://ftp.ncbi.nlm.nih.gov/genbank/gbvrl*.seq.gz (about 4.2GB in total) (my current database may differ a bit from the most recent database though, let me know if you would like to have the exact same database).

I'm using these same commands on the 3 systems, not all cases go wrong, but I have isolated a case in idx nr 2621935 which gives different results on windows as compared to Ubuntu and CentOS.

> gbvrl_fa <- open(FaFile("gbvrl.nt.fasta"))
> idx <- scanFaIndex(gbvrl_fa)
> sequence <- scanFa(gbvrl_fa, idx[2621935])

On windows I get this result:

  A DNAStringSet instance of length 1
    width seq                                                                                   names               
[1]  4979 CATGGGTATGGCATCAGTCCTAGATAAAGGGACTGGCAAGT...TTTAACTGTCGTATCTCTTGTCTGTGGTATACTTAGTCTAG EU851411.1

On the other two platforms I get this result (which is the correct sequence, the sequence on windows is incorrect):

A DNAStringSet instance of length 1
    width seq                                                                                   names
[1]  4979 TTGAACAAGTAACCAGTCGTAAG...CTTACGACTGGTTACTTGTTCAA EU851411.1

The accession numbers match, but the sequences don't...

A second hint is the last sequence in the index, which the 2 non-windows platforms retrieve just fine, but the windows version gives an error:

Windows:

> scanFa(gbvrl_fa, idx[length(idx)])
Error in value[[3L]](cond) : 
   record 1 (FN398484.1:1-903) contains invalid DNA letters
  file: gbvrl.nt.fasta

Other two:
image

The last hint is an error claiming that the reference is not present in the current index, which happens with some sequences:

Windows:

> scanFa(gbvrl_fa, idx[1400001])
Error in value[[3L]](cond) :  record 1 (KY058615.1:1-1166) failed
  file: gbvrl.nt.fasta

Other two:
image

Sequences in the index before around 1,300,000 seem to succeed at retrieving the correct sequence and sequences after around 1,400,000 seem to fail, I don't know if that makes sense.

I've checked the Rsamtools and dependencies's versions and they match. Also the md5sum of all the fasta and fasta.fai files match.

I hope that these examples shine some light on the issues.

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shinydashboard_0.7.1  rbokeh_0.5.0          shinyWidgets_0.4.8    shinycssloaders_0.2.0 stringr_1.4.0        
 [6] ggplot2_3.2.1         data.table_1.12.2     shiny_1.3.2           Rsamtools_2.0.3       Biostrings_2.52.0    
[11] XVector_0.24.0        GenomicRanges_1.36.1  GenomeInfoDb_1.20.0   IRanges_2.18.3        S4Vectors_0.22.0     
[16] BiocGenerics_0.30.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2             lattice_0.20-38        assertthat_0.2.1       digest_0.6.20          mime_0.7              
 [6] R6_2.4.0               evaluate_0.14          httr_1.4.1             pillar_1.4.2           zlibbioc_1.30.0       
[11] rlang_0.4.0            lazyeval_0.2.2         rstudioapi_0.10        hexbin_1.27.3          DT_0.9                
[16] rmarkdown_1.16         BiocParallel_1.18.1    htmlwidgets_1.5.1      RCurl_1.95-4.12        munsell_0.5.0         
[21] compiler_3.6.1         httpuv_1.5.1           xfun_0.10              gistr_0.4.2            pkgconfig_2.0.3       
[26] htmltools_0.3.6        tidyselect_0.2.5       tibble_2.1.3           GenomeInfoDbData_1.2.1 codetools_0.2-16      
[31] crayon_1.3.4           dplyr_0.8.3            withr_2.1.2            later_0.8.0            bitops_1.0-6          
[36] grid_3.6.1             jsonlite_1.6           xtable_1.8-4           gtable_0.3.0           magrittr_1.5          
[41] scales_1.0.0           stringi_1.4.3          pryr_0.1.4             promises_1.0.1         tools_3.6.1           
[46] glue_1.3.1             purrr_0.3.2            crosstalk_1.0.0        maps_3.3.0             yaml_2.2.0            
[51] colorspace_1.4-1       BiocManager_1.30.8     knitr_1.25

CentOS 7
image

Ubuntu 18.04
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant