Where can you automatically download metagenome data? #59

hollybik · 2012-01-31T19:43:32Z

Looking for unassembled reads, assembled contigs, or both

MG-RAST - no obvious FTP

IMG database - http://img.jgi.doe.gov/ - Actually lists new metagenomes since last release version (versus all other sites) - links out to the GOLD database and SRA but it looks like you have to go through the individual dataset records...

SRA - Aspera access. Not clear if there is something like an RSS feed where we can get auto notifications of new data. Also we have to be careful with pulling down datasets based on terminology - people are submitting data under multiple labels, including "metagenome" and "metagenomics"

EBI - AWESOME new metagenomics portal: https://www.ebi.ac.uk/metagenomics/ You can view Gene Ontologies and GPS directly on the sample information page. Not clear if there is an FTP site behind this GUI, kind of seems like you'd have to go through NCBI's FTP site where the data is mirrored to get the raw reads.

koadman · 2012-01-31T20:50:57Z

We would want FTP or http or aspera access. Ideally we could get a list of URLs to download automatically, see for example the text files like this on the EBI site which contain accession numbers:
http://www.ebi.ac.uk/genomes/organelle.details.txt
and the ftp URL can be programmatically constructed from the accession.

Alternatively, a manually and continuously updated list of URLs for metagenomic data is another option, though less desirable for obvious reasons

koadman · 2012-01-31T21:14:39Z

raw reads are preferred since we can assemble them with very stringent parameters that will prevent or at least limit chimerism.

hollybik · 2012-02-01T20:02:51Z

After today's scrum discussion, I think our starting point should be the NCBI SRA archive with specific search filters:
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&term=Metagenomics%5Bstudy+Type%5D+OR+Metagenome%5BStudy+Type%5D&ord=acc&page=1

This is the web address you should pull from in any script - but looks like we'll have to mine data across multiple pages (e.g. make page=1 a variable within the web address that can be incrementally increased with a counter)

gjospin · 2012-02-01T22:58:04Z

Lets try to estimate the size of what would be downloaded so we don't blow up our machines. One project has 68G of data while another is 3M. There are 1110 available data sets.
We could always target a subset of those sets.

koadman · 2012-02-02T00:26:28Z

Subsetting the data will be ok, and it's true we won't be able to keep the reads locally once processed. There are probably > 10TB of metagenome data in SRA already...

hollybik · 2012-02-08T18:11:12Z

@gjospin is writing a script

ghost assigned hollybik Jan 31, 2012

hollybik closed this as completed Feb 27, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where can you automatically download metagenome data? #59

Where can you automatically download metagenome data? #59

hollybik commented Jan 31, 2012

koadman commented Jan 31, 2012

koadman commented Jan 31, 2012

hollybik commented Feb 1, 2012

gjospin commented Feb 1, 2012

koadman commented Feb 2, 2012

hollybik commented Feb 8, 2012

Where can you automatically download metagenome data? #59

Where can you automatically download metagenome data? #59

Comments

hollybik commented Jan 31, 2012

koadman commented Jan 31, 2012

koadman commented Jan 31, 2012

hollybik commented Feb 1, 2012

gjospin commented Feb 1, 2012

koadman commented Feb 2, 2012

hollybik commented Feb 8, 2012