Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ftp errors when using meta.retrieval #12

Closed
NicoleGruenheit opened this issue Apr 20, 2017 · 4 comments
Closed

ftp errors when using meta.retrieval #12

NicoleGruenheit opened this issue Apr 20, 2017 · 4 comments

Comments

@NicoleGruenheit
Copy link

Hi,

First, I really like this package, thanks for putting it together!

I tried to use meta.retrieval but, if the list of genomes is quite long (e.g. all alphaproteobacteria) it never finishes because there are several ftp errors. One is that the length of the downloaded file is 0 but when I try to download that file via ftp on the commandline, everything is fine. Another error is that the ftp site is suddenly not responding anymore.

The problem is, that I'd have to restart the download of all those genomes again. Is it possible to catch those errors from get.Genome and download all the genomes that work, then go back to the ones that didn't and retry those once? If they still don't work an error message listing all the failed ones including the ftp sites for those could enable the user to download them manually.

Also, let's say there was some problem with the internet connection and the process was killed halfway through, would it be possible to implement a check that first reads all the files in the folder and then marks the genomes that are already there (basically deletes them from the FinalGenomes vector)?

Cheers,
Nicole

@HajkD
Copy link
Member

HajkD commented Apr 20, 2017

Hi Nicole,

Thank you so much for contacting me and making me aware of the problem. I really try my best to provide a useful and fully functional tool and thus only thanks to feedback like yours I can improve biomartr.

I am also in contact with Akshaya Ramesh (@ARamesh123) who pointed out a similar problem to me and who really helped me a lot to trouble shoot and find the bug.

To give you a short answer to your question concerning the re-download of genomes which couldn't be downloaded: If you install the developer version of biomartr this re-download functionality is now included and downloads don't start all over again. As soon as I fixed this issue of downloading thousands of bacterial genomes and after passing the on-boarding process to rOpenSci, I will submit the new biomartr version to CRAN.

You can download the developer version by typing:

# install the current version of biomartr on your system
source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr")

Following are the problems Akshaya found:

  1. I have no problem while downloading smaller datasets; e.g.: all bacterial genbank sequences from the subgroup Cyanobavteria

  2. I still get timeout on larger processes..sometimes it just keeps running without any error message; and is stalling and sometimes I get error message saying:

Error in open.connection(con, open = mode) : Timeout was reached Calls: <Anonymous> ... <Anonymous> -> curl_connection -> open -> open.connection Execution halted

  1. The interesting thing is that I ran the command for different downloads (different subgroup downloading) on different machines and timeout was reached at the SAME time - so there was some miscommunication with the ftp @ one moment leading to timeout @ same time…

  2. The meta.retrieval function works and re-starts download where It originally dropped off, but I was still not able to download all files for larger subgroups e.g.: Bacteroidetes

When assessing the problems Akshaya pointed out to me, it seems that there could be a problem on the server side when running hundreds or thousands of access queries to NCBI. The fact, that TIMEOUT is reached always at the same time shows that there must be some kind of query counter from the same IP address implemented on the NCBI server side. I will screen more closely the NCBI guidelines and maybe write an email to the NCBI server maintainers.

What I can do from my side is to try to stop the download whenever I don't receive server feedback anymore and then the user (after a while) can try to re-run the meta.retrieval function and start downloading from where they left off, just as you and Akshaya proposed.

I will start working on that now and will come back to you as soon as I found a good solution.

I apologize for the inconvenience the server timeout issue might have caused when using biomartr, but I hope that the package will be useful once this issue is solved.

Many thanks and best wishes,
Hajk

@HajkD HajkD added the bug label Apr 20, 2017
@NicoleGruenheit
Copy link
Author

NicoleGruenheit commented Apr 21, 2017 via email

@HajkD
Copy link
Member

HajkD commented Apr 21, 2017

Hi Nicole,

Thank you so much for your fast response :)

I just found this bit of documentation that corresponds to your great suggestion to
chunk down the genome files into chunks of e.g. 500 -> https://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large

I also see that for constructing the queries to do so, NCBI required the common names of organisms, e.g. chimpanzee instead of the scientific name which I find extremely impracticable for meta-genome retrieval especially when dealing with bacteria, viruses or archaea which usually don't have common names....

Anyway, I will see what I can do :) Maybe I can also get in contact with some NCBI people to see if they can help me out with that one.

Concerning your request to include the NCBI Taxonomy for phyla classified retrieval of genomes, I think this is a great idea! Since most of my scientific projects have an evolutionary context anyway, I can put this functionality extension on my TODO list.

As a future outlook, I plan to write useful interfaces between biomartr and BLAST searches, comparative genomics tools (see e.g. orthologr), and phylogeny inference tools.

The idea is to combine my packages such as biomartr and orthologr for standard bioinformatics tasks such as e.g. multiple sequence alignment via the magrittr pipelining approach.

For example, multiple sequence alignments for a set of genomes can be performed by simply running:

biomartr::meta.retrieval() %>% orthologr::multi_aln() %>% ...

Or pairwise orthology inference via BLAST reciprocal best hit can then be performed by running:

biomartr::meta.retrieval() %>% orthologr::map.generator() %>% ...

Or phylogeny inference via:

biomartr::meta.retrieval() %>% phylr::tree_infer() %>% ...

Unfortunately, this is still work in progress, but on the way, I am always happy to receive input for potential improvements or functionality extensions.

Thank you so much for your help and feedback, I truly appreciate it :)

I will keep you posted about the new functionalities.

Best wishes,
Hajk

@HajkD HajkD added enhancement and removed bug labels Apr 21, 2017
@HajkD
Copy link
Member

HajkD commented Sep 27, 2023

Please have a look at our new software GenEra which may be able to help here: https://github.com/josuebarrera/GenEra .

Cheers,
Hajk

@HajkD HajkD closed this as completed Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants