You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking for unassembled reads, assembled contigs, or both
MG-RAST - no obvious FTP
IMG database - http://img.jgi.doe.gov/ - Actually lists new metagenomes since last release version (versus all other sites) - links out to the GOLD database and SRA but it looks like you have to go through the individual dataset records...
SRA - Aspera access. Not clear if there is something like an RSS feed where we can get auto notifications of new data. Also we have to be careful with pulling down datasets based on terminology - people are submitting data under multiple labels, including "metagenome" and "metagenomics"
EBI - AWESOME new metagenomics portal: https://www.ebi.ac.uk/metagenomics/ You can view Gene Ontologies and GPS directly on the sample information page. Not clear if there is an FTP site behind this GUI, kind of seems like you'd have to go through NCBI's FTP site where the data is mirrored to get the raw reads.
The text was updated successfully, but these errors were encountered:
We would want FTP or http or aspera access. Ideally we could get a list of URLs to download automatically, see for example the text files like this on the EBI site which contain accession numbers: http://www.ebi.ac.uk/genomes/organelle.details.txt
and the ftp URL can be programmatically constructed from the accession.
Alternatively, a manually and continuously updated list of URLs for metagenomic data is another option, though less desirable for obvious reasons
This is the web address you should pull from in any script - but looks like we'll have to mine data across multiple pages (e.g. make page=1 a variable within the web address that can be incrementally increased with a counter)
Lets try to estimate the size of what would be downloaded so we don't blow up our machines. One project has 68G of data while another is 3M. There are 1110 available data sets.
We could always target a subset of those sets.
Subsetting the data will be ok, and it's true we won't be able to keep the reads locally once processed. There are probably > 10TB of metagenome data in SRA already...
Looking for unassembled reads, assembled contigs, or both
MG-RAST - no obvious FTP
IMG database - http://img.jgi.doe.gov/ - Actually lists new metagenomes since last release version (versus all other sites) - links out to the GOLD database and SRA but it looks like you have to go through the individual dataset records...
SRA - Aspera access. Not clear if there is something like an RSS feed where we can get auto notifications of new data. Also we have to be careful with pulling down datasets based on terminology - people are submitting data under multiple labels, including "metagenome" and "metagenomics"
EBI - AWESOME new metagenomics portal: https://www.ebi.ac.uk/metagenomics/ You can view Gene Ontologies and GPS directly on the sample information page. Not clear if there is an FTP site behind this GUI, kind of seems like you'd have to go through NCBI's FTP site where the data is mirrored to get the raw reads.
The text was updated successfully, but these errors were encountered: