The MetaGraph Sequence Index dataset offers full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA). Currently, the index supports searches across more than 10 million individual samples, with this number steadily increasing as indexing efforts continue.
Following the principle of phylogenetic compression, we have hierarchically clustered all samples using information from their biological taxonomy (as far as available). As a result, we currently have a pool of approximately 5,000 individual index chunks. Each of these chunks contains the information of a subset of the samples. Every chunk is assigned into one taxonomic categories. Overall, there are approx 200 taxonomic categories available, each containing only a few up to over 1,000 individual index chunks. The number of chunks within the same category is mostly driven by the number of samples available from that taxonomic group. The chunk size is limited for practical reasons, to allow for parallel construction and querying.
Individual categories were formed by grouping phylogenetically similar samples together. This grouping started at the species level of the taxonomic tree. If too few samples were available to form a chunk, the taxonomic parent was selected for aggregation for samples. The resulting list of categories is available here.
All data is available under the following root: s3://metagraph/all_sra
s3://metagraph/all_sra
+-- data
| +-- category_A
| | +-- chunk_1
| | +-- ...
| +-- ...
+-- metadata
+-- category_A
| +-- chunk_1
| +-- ...
+-- ...
Where category_A
would be one of the Available categories mentioned above. Likewise, chunk_1
would be replaced with a running number of the chunk, padded with zeros up to a total length of 4.
As an example, to reach the data for the 10th chunk of the metagenome
category, the resulting path would be s3://metagraph/all_sra/data/metagenome/0010/
.
Irrespective of whether you are in the data
or the metadata
branch, each chunk contains a standardized set of files.
In the data
branch one chunk contains:
annotation.clean.row_diff_brwt.annodbg
annotation.clean.row_diff_brwt.annodbg.md5
graph.primary.small.dbg
graph.primary.small.dbg.md5
Both files ending with dbg
are needed for a full-text query. They form the MetaGraph index. The files ending in md5
are check sums to verify correct transfer of data in case you download it.
In the metadata
branch one chunk contains:
metadata.tsv.gz
This is a gzip-compress, human readable text file containing additional information about the samples that are contained within each index chunk.
The following steps describe how to set up a search query across all or a subset of available index files.
Please refer to the AWS docs for the installation instructions and prerequisites:
For the third step, we recommend using Single Sign-On (SSO) authentication via IAM Identity Center:
aws configure sso
You can find the SSO Start URL in your AWS access portal. Please make sure to select default
when prompted for profile name.
Alternatively, you can setup your credentials using the following environment variables:
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_SESSION_TOKEN="..."
or through creation of the plain text file ~/.aws/credentials
with the following content:
[default]
aws_access_key_id=...
aws_secret_access_key=...
aws_session_token=...
You can find specific tokens and keys in the "Access keys" section of your AWS access portal after signing in.
git clone https://github.com/ratschlab/metagraph-open-data.git
cd metagraph-open-data
We assume that you work in the eu-central-2
region, and your aws
authentication is configured in the default
profile.
The deployment script will setup the following on your AWS using the CloudFormation template:
- S3 bucket to store your queries and their results;
- AWS Batch environment to execute the queries;
- Step Function and Lambdas to schedule your queries as individual Batch tasks and merge their results;
- SNS topic to send notifications to when the query is fully processed.
If you want to receive Simple Notification Service (SNS) notifications after a query is processed, you have to provide your email to the script using the --email test@example.com
argument. You need to confirm the subscription via a link sent in an e-mail to your mailbox.:
scripts/deploy-metagraph.sh --email test@example.com
If you want to use your own Amazon Machine Image (AMI) for AWS Batch jobs (e.g., for security reasons or to support newer MetaGraph features), use --ami ami-...
to provide your AMI ID or request that it is built using your AWS resources via --ami build
. The latter uses EC2 and may take up to 30 minutes!
scripts/upload-query.sh examples/test_query.fasta
You can upload your own queries by providing /path/to/query.fasta
instead of examples/test_query.fasta
. You can also upload examples/100_studies_short.fq
if you would like to test the setup on a larger query.
You need to describe your query in a JSON file. A minimal job definition (examples/scheduler-payload.json
) looks as follows:
{
"index_prefix": "all_sra",
"query_filename": "test_query.fasta",
"index_filter": ".*000[1-5]$"
}
As of now, only dataset indexes stored in s3://metagraph
are supported. Generally, the arguments that you can provide are as follows:
index_prefix
, e.g.all_sra
orall_sra/data/metagenome
. Only chunks in the subdirectories ofindex_prefix
will be considered for querying.query_filename
, the filename of the query that you previously uploaded viascripts/upload-query.sh
.index_filter
(.*
by default), a re-compatible regular expression to filter paths to chunks on which the query is to be executed.
Additionally, you can specify the following parameters to be passed to the MetaGraph CLI for all queried chunks:
query_mode
(labels
by default),num_top_labels
(inf
by default),min_kmers_fraction_label
(0.7
by default),min_kmers_fraction_graph
(0.0
by default).
You can submit the query for execution with the following command:
scripts/start-metagraph-job.sh examples/scheduler-payload.json
It will create a dedicated AWS Batch job for each queried chunk, adjusting allocated memory (RAM) to the chunk size.
You can use our example JSON payload for the large query in examples/large-query.json
:
{
"index_prefix": "all_sra",
"query_filename": "100_studies_short.fq",
"index_filter": ".*001[0-9]$",
"query_mode": "matches",
"num_top_labels": "10",
"min_kmers_fraction_label": "0"
}
This will execute the following command for all chunks from 0010
to 0019
:
metagraph query -i graph.primary.small.dbg \
-a annotation.clean.row_diff_brwt.annodbg \
--query-mode matches \
--num-top-labels 10 \
--min-kmers-fraction-label 0 \
--min-kmers-fraction-graph 0 \
100_studies_short.fq
Then, it will save the resulting file in the S3. When all chunks are processed, a dedicated script will merge the results in a single file and send you a notification.