ftp errors when using meta.retrieval #12

NicoleGruenheit · 2017-04-20T10:29:13Z

Hi,

First, I really like this package, thanks for putting it together!

I tried to use meta.retrieval but, if the list of genomes is quite long (e.g. all alphaproteobacteria) it never finishes because there are several ftp errors. One is that the length of the downloaded file is 0 but when I try to download that file via ftp on the commandline, everything is fine. Another error is that the ftp site is suddenly not responding anymore.

The problem is, that I'd have to restart the download of all those genomes again. Is it possible to catch those errors from get.Genome and download all the genomes that work, then go back to the ones that didn't and retry those once? If they still don't work an error message listing all the failed ones including the ftp sites for those could enable the user to download them manually.

Also, let's say there was some problem with the internet connection and the process was killed halfway through, would it be possible to implement a check that first reads all the files in the folder and then marks the genomes that are already there (basically deletes them from the FinalGenomes vector)?

Cheers,
Nicole

HajkD · 2017-04-20T15:06:41Z

Hi Nicole,

Thank you so much for contacting me and making me aware of the problem. I really try my best to provide a useful and fully functional tool and thus only thanks to feedback like yours I can improve biomartr.

I am also in contact with Akshaya Ramesh (@ARamesh123) who pointed out a similar problem to me and who really helped me a lot to trouble shoot and find the bug.

To give you a short answer to your question concerning the re-download of genomes which couldn't be downloaded: If you install the developer version of biomartr this re-download functionality is now included and downloads don't start all over again. As soon as I fixed this issue of downloading thousands of bacterial genomes and after passing the on-boarding process to rOpenSci, I will submit the new biomartr version to CRAN.

You can download the developer version by typing:

# install the current version of biomartr on your system
source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr")

Following are the problems Akshaya found:

I have no problem while downloading smaller datasets; e.g.: all bacterial genbank sequences from the subgroup Cyanobavteria

I still get timeout on larger processes..sometimes it just keeps running without any error message; and is stalling and sometimes I get error message saying:

Error in open.connection(con, open = mode) : Timeout was reached Calls: <Anonymous> ... <Anonymous> -> curl_connection -> open -> open.connection Execution halted

The interesting thing is that I ran the command for different downloads (different subgroup downloading) on different machines and timeout was reached at the SAME time - so there was some miscommunication with the ftp @ one moment leading to timeout @ same time…

The meta.retrieval function works and re-starts download where It originally dropped off, but I was still not able to download all files for larger subgroups e.g.: Bacteroidetes

When assessing the problems Akshaya pointed out to me, it seems that there could be a problem on the server side when running hundreds or thousands of access queries to NCBI. The fact, that TIMEOUT is reached always at the same time shows that there must be some kind of query counter from the same IP address implemented on the NCBI server side. I will screen more closely the NCBI guidelines and maybe write an email to the NCBI server maintainers.

What I can do from my side is to try to stop the download whenever I don't receive server feedback anymore and then the user (after a while) can try to re-run the meta.retrieval function and start downloading from where they left off, just as you and Akshaya proposed.

I will start working on that now and will come back to you as soon as I found a good solution.

I apologize for the inconvenience the server timeout issue might have caused when using biomartr, but I hope that the package will be useful once this issue is solved.

Many thanks and best wishes,
Hajk

NicoleGruenheit · 2017-04-21T08:10:23Z

Hi Hajk, thanks for your fast reply. I do remember that NCBI doesn’t like a lot of single blast searches so I’m guessing you are right that they restrict the number of connections somehow. Maybe there is something like their bulk web access where you send a list of IDs and retrieve all of them in one go? It seems to work if I break the bigger datasets down into chunks of 100 and use a for loop just like in your meta.retrieval function. I was also thinking that sorting the bacteria into folders by phyla would be really useful for phylogenetic studies. I’ve just talked to a collaborator yesterday and we’d like to do an analysis, which gives us a result per phylum. I know this is not a trivial thing to do. There is the taxonomy database at NCBI and it should be possible to retrieve the full lineage of an organism using it’s ID but as far as I can remember, the lineages are not incredibly standardised. e.g. you’d want the same number of entries for each lineage but there are cases where somebody has inserted a subgroup of something and it doesn’t work anymore. So I’m not sure if this would be doable. I’ll definitely recommend your package to others! Cheers, Nicole Dr. Nicole Gruenheit Research Associate Faculty of Biology, Medicine and Health Michael Smith Building Oxford Road, Manchester, M13 9PT The University of Manchester http://thethompsonlab.wordpress.com/

…

On 20 Apr 2017, at 16:06, Hajk-Georg Drost ***@***.***> wrote: Hi Nicole, Thank you so much for contacting me and making me aware of the problem. I really try my best to provide a useful and fully functional tool and thus only thanks to feedback like yours I can improve biomartr. I am also in contact with Akshaya Ramesh ***@***.*** <https://github.com/ARamesh123>) who pointed out a similar problem <#6> to me and who really helped me a lot to trouble shoot and find the bug. To give you a short answer to your question concerning the re-download of genomes which couldn't be downloaded: If you install the developer version of biomartr this re-download functionality is now included and downloads don't start all over again. As soon as I fixed this issue of downloading thousands of bacterial genomes and after passing the on-boarding <ropensci/software-review#93> process to rOpenSci, I will submit the new biomartr version to CRAN. You can download the developer version by typing: # install the current version of biomartr on your system source("http://bioconductor.org/biocLite.R") biocLite("HajkD/biomartr") Following are the problems Akshaya found: I have no problem while downloading smaller datasets; e.g.: all bacterial genbank sequences from the subgroup Cyanobavteria I still get timeout on larger processes..sometimes it just keeps running without any error message; and is stalling and sometimes I get error message saying: Error in open.connection(con, open = mode) : Timeout was reached Calls: <Anonymous> ... <Anonymous> -> curl_connection -> open -> open.connection Execution halted The interesting thing is that I ran the command for different downloads (different subgroup downloading) on different machines and timeout was reached at the SAME time - so there was some miscommunication with the ftp @ one moment leading to timeout @ same time… The meta.retrieval function works and re-starts download where It originally dropped off, but I was still not able to download all files for larger subgroups e.g.: Bacteroidetes When assessing the problems Akshaya pointed out to me, it seems that there could be a problem on the server side when running hundreds or thousands of access queries to NCBI. The fact, that TIMEOUT is reached always at the same time shows that there must be some kind of query counter from the same IP address implemented on the NCBI server side. I will screen more closely the NCBI guidelines and maybe write an email to the NCBI server maintainers. What I can do from my side is to try to stop the download whenever I don't receive server feedback anymore and then the user (after a while) can try to re-run the meta.retrieval function and start downloading from where they left off, just as you and Akshaya proposed. I will start working on that now and will come back to you as soon as I found a good solution. I apologize for the inconvenience the server timeout issue might have caused when using biomartr, but I hope that the package will be useful once this issue is solved. Many thanks and best wishes, Hajk — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKxoExm5500n6WmRqr_KzmQnU2Wm9xoiks5rx3SCgaJpZM4NC1uK>.

HajkD · 2017-04-21T13:47:55Z

Hi Nicole,

Thank you so much for your fast response :)

I just found this bit of documentation that corresponds to your great suggestion to
chunk down the genome files into chunks of e.g. 500 -> https://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large

I also see that for constructing the queries to do so, NCBI required the common names of organisms, e.g. chimpanzee instead of the scientific name which I find extremely impracticable for meta-genome retrieval especially when dealing with bacteria, viruses or archaea which usually don't have common names....

Anyway, I will see what I can do :) Maybe I can also get in contact with some NCBI people to see if they can help me out with that one.

Concerning your request to include the NCBI Taxonomy for phyla classified retrieval of genomes, I think this is a great idea! Since most of my scientific projects have an evolutionary context anyway, I can put this functionality extension on my TODO list.

As a future outlook, I plan to write useful interfaces between biomartr and BLAST searches, comparative genomics tools (see e.g. orthologr), and phylogeny inference tools.

The idea is to combine my packages such as biomartr and orthologr for standard bioinformatics tasks such as e.g. multiple sequence alignment via the magrittr pipelining approach.

For example, multiple sequence alignments for a set of genomes can be performed by simply running:

biomartr::meta.retrieval() %>% orthologr::multi_aln() %>% ...

Or pairwise orthology inference via BLAST reciprocal best hit can then be performed by running:

biomartr::meta.retrieval() %>% orthologr::map.generator() %>% ...

Or phylogeny inference via:

biomartr::meta.retrieval() %>% phylr::tree_infer() %>% ...

Unfortunately, this is still work in progress, but on the way, I am always happy to receive input for potential improvements or functionality extensions.

Thank you so much for your help and feedback, I truly appreciate it :)

I will keep you posted about the new functionalities.

Best wishes,
Hajk

HajkD · 2023-09-27T13:48:30Z

Please have a look at our new software GenEra which may be able to help here: https://github.com/josuebarrera/GenEra .

Cheers,
Hajk

HajkD added the bug label Apr 20, 2017

HajkD added enhancement and removed bug labels Apr 21, 2017

HajkD closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ftp errors when using meta.retrieval #12

ftp errors when using meta.retrieval #12

NicoleGruenheit commented Apr 20, 2017

HajkD commented Apr 20, 2017 •

edited

NicoleGruenheit commented Apr 21, 2017 via email

HajkD commented Apr 21, 2017

HajkD commented Sep 27, 2023

ftp errors when using meta.retrieval #12

ftp errors when using meta.retrieval #12

Comments

NicoleGruenheit commented Apr 20, 2017

HajkD commented Apr 20, 2017 • edited

NicoleGruenheit commented Apr 21, 2017 via email

HajkD commented Apr 21, 2017

HajkD commented Sep 27, 2023

HajkD commented Apr 20, 2017 •

edited