Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where can you automatically download metagenome data? #59

Closed
hollybik opened this issue Jan 31, 2012 · 6 comments
Closed

Where can you automatically download metagenome data? #59

hollybik opened this issue Jan 31, 2012 · 6 comments
Assignees

Comments

@hollybik
Copy link
Collaborator

Looking for unassembled reads, assembled contigs, or both

MG-RAST - no obvious FTP

IMG database - http://img.jgi.doe.gov/ - Actually lists new metagenomes since last release version (versus all other sites) - links out to the GOLD database and SRA but it looks like you have to go through the individual dataset records...

SRA - Aspera access. Not clear if there is something like an RSS feed where we can get auto notifications of new data. Also we have to be careful with pulling down datasets based on terminology - people are submitting data under multiple labels, including "metagenome" and "metagenomics"

EBI - AWESOME new metagenomics portal: https://www.ebi.ac.uk/metagenomics/ You can view Gene Ontologies and GPS directly on the sample information page. Not clear if there is an FTP site behind this GUI, kind of seems like you'd have to go through NCBI's FTP site where the data is mirrored to get the raw reads.

@ghost ghost assigned hollybik Jan 31, 2012
@koadman
Copy link
Collaborator

koadman commented Jan 31, 2012

We would want FTP or http or aspera access. Ideally we could get a list of URLs to download automatically, see for example the text files like this on the EBI site which contain accession numbers:
http://www.ebi.ac.uk/genomes/organelle.details.txt
and the ftp URL can be programmatically constructed from the accession.

Alternatively, a manually and continuously updated list of URLs for metagenomic data is another option, though less desirable for obvious reasons

@koadman
Copy link
Collaborator

koadman commented Jan 31, 2012

raw reads are preferred since we can assemble them with very stringent parameters that will prevent or at least limit chimerism.

@hollybik
Copy link
Collaborator Author

hollybik commented Feb 1, 2012

After today's scrum discussion, I think our starting point should be the NCBI SRA archive with specific search filters:
http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&term=Metagenomics%5Bstudy+Type%5D+OR+Metagenome%5BStudy+Type%5D&ord=acc&page=1

This is the web address you should pull from in any script - but looks like we'll have to mine data across multiple pages (e.g. make page=1 a variable within the web address that can be incrementally increased with a counter)

@gjospin
Copy link
Owner

gjospin commented Feb 1, 2012

Lets try to estimate the size of what would be downloaded so we don't blow up our machines. One project has 68G of data while another is 3M. There are 1110 available data sets.
We could always target a subset of those sets.

@koadman
Copy link
Collaborator

koadman commented Feb 2, 2012

Subsetting the data will be ok, and it's true we won't be able to keep the reads locally once processed. There are probably > 10TB of metagenome data in SRA already...

@hollybik
Copy link
Collaborator Author

hollybik commented Feb 8, 2012

@gjospin is writing a script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants