MetaGraph Sequence Indexes

Overview

The MetaGraph Sequence Index dataset offers full-text searchable index files for raw sequencing data hosted in major public repositories. These include the European Nucleotide Archive (ENA) managed by the European Bioinformatics Institute (EMBL-EBI), the Sequence Read Archive (SRA) maintained by the National Center for Biotechnology Information (NCBI), and the DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA). Currently, the index supports searches across more than 10 million individual samples, with this number steadily increasing as indexing efforts continue.

Data

Summary

Following the principle of phylogenetic compression, we have hierarchically clustered all samples using information from their biological taxonomy (as far as available). As a result, we currently have a pool of approximately 5,000 individual index chunks. Each of these chunks contains the information of a subset of the samples. Every chunk is assigned into one taxonomic categories. Overall, there are approx 200 taxonomic categories available, each containing only a few up to over 1,000 individual index chunks. The number of chunks within the same category is mostly driven by the number of samples available from that taxonomic group. The chunk size is limited for practical reasons, to allow for parallel construction and querying.

Available categories

Individual categories were formed by grouping phylogenetically similar samples together. This grouping started at the species level of the taxonomic tree. If too few samples were available to form a chunk, the taxonomic parent was selected for aggregation for samples. The resulting list of categories is available here.

Dataset layout

All data is available under the following root: s3://metagraph/all_sra

s3://metagraph/all_sra
+-- data
|   +-- category_A
|   |   +-- chunk_1
|   |   +-- ...
|   +-- ...
+-- metadata
    +-- category_A
    |   +-- chunk_1
    |   +-- ...
    +-- ...

Where category_A would be one of the Available categories mentioned above. Likewise, chunk_1 would be replaced with a running number of the chunk, padded with zeros up to a total length of 4.

As an example, to reach the data for the 10th chunk of the metagenome category, the resulting path would be s3://metagraph/all_sra/data/metagenome/0010/.

Chunk structure

Irrespective of whether you are in the data or the metadata branch, each chunk contains a standardized set of files.

In the data branch one chunk contains:

annotation.clean.row_diff_brwt.annodbg
annotation.clean.row_diff_brwt.annodbg.md5
graph.primary.small.dbg
graph.primary.small.dbg.md5

Both files ending with dbg are needed for a full-text query. They form the MetaGraph index. The files ending in md5 are check sums to verify correct transfer of data in case you download it.

In the metadata branch one chunk contains:

metadata.tsv.gz

This is a gzip-compress, human readable text file containing additional information about the samples that are contained within each index chunk.

Usage within AWS

The following steps describe how to set up a search query across all or a subset of available index files.

Configure the `aws` CLI tool

Please refer to the AWS docs for the installation instructions and prerequisites:

For the third step, we recommend using Single Sign-On (SSO) authentication via IAM Identity Center:

aws configure sso

You can find the SSO Start URL in your AWS access portal. Please make sure to select default when prompted for profile name.

Alternatively, you can setup your credentials using the following environment variables:

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_SESSION_TOKEN="..."

or through creation of the plain text file ~/.aws/credentials with the following content:

[default]
aws_access_key_id=...
aws_secret_access_key=...
aws_session_token=...

You can find specific tokens and keys in the "Access keys" section of your AWS access portal after signing in.

Clone the project

git clone https://github.com/ratschlab/metagraph-open-data.git
cd metagraph-open-data

Deploy the Cloud Formation template

We assume that you work in the eu-central-2 region, and your aws authentication is configured in the default profile.

The deployment script will setup the following on your AWS using the CloudFormation template:

S3 bucket to store your queries and their results;
AWS Batch environment to execute the queries;
Step Function and Lambdas to schedule your queries as individual Batch tasks and merge their results;
SNS topic to send notifications to when the query is fully processed.

If you want to receive Simple Notification Service (SNS) notifications after a query is processed, you have to provide your email to the script using the --email test@example.com argument. You need to confirm the subscription via a link sent in an e-mail to your mailbox.:

scripts/deploy-metagraph.sh --email test@example.com

If you want to use your own Amazon Machine Image (AMI) for AWS Batch jobs (e.g., for security reasons or to support newer MetaGraph features), use --ami ami-... to provide your AMI ID or request that it is built using your AWS resources via --ami build. The latter uses EC2 and may take up to 30 minutes!

Upload your query to the S3 bucket

scripts/upload-query.sh examples/test_query.fasta

You can upload your own queries by providing /path/to/query.fasta instead of examples/test_query.fasta. You can also upload examples/100_studies_short.fq if you would like to test the setup on a larger query.

Submit a job

You need to describe your query in a JSON file. A minimal job definition (examples/scheduler-payload.json) looks as follows:

{
    "index_prefix": "all_sra",
    "query_filename": "test_query.fasta",
    "index_filter": ".*000[1-5]$"
}

As of now, only dataset indexes stored in s3://metagraph are supported. Generally, the arguments that you can provide are as follows:

index_prefix, e.g. all_sra or all_sra/data/metagenome. Only chunks in the subdirectories of index_prefix will be considered for querying.
query_filename, the filename of the query that you previously uploaded via scripts/upload-query.sh.
index_filter (.* by default), a re-compatible regular expression to filter paths to chunks on which the query is to be executed.

Additionally, you can specify the following parameters to be passed to the MetaGraph CLI for all queried chunks:

query_mode (labels by default),
num_top_labels (inf by default),
min_kmers_fraction_label (0.7 by default),
min_kmers_fraction_graph (0.0 by default).

You can submit the query for execution with the following command:

scripts/start-metagraph-job.sh examples/scheduler-payload.json

It will create a dedicated AWS Batch job for each queried chunk, adjusting allocated memory (RAM) to the chunk size.

Large query example

You can use our example JSON payload for the large query in examples/large-query.json:

{
    "index_prefix": "all_sra",
    "query_filename": "100_studies_short.fq",
    "index_filter": ".*001[0-9]$",
    "query_mode": "matches",
    "num_top_labels": "10",
    "min_kmers_fraction_label": "0"
}

This will execute the following command for all chunks from 0010 to 0019:

metagraph query -i graph.primary.small.dbg \
                -a annotation.clean.row_diff_brwt.annodbg \
                --query-mode matches \
                --num-top-labels 10 \
                --min-kmers-fraction-label 0 \
                --min-kmers-fraction-graph 0 \
                100_studies_short.fq

Then, it will save the resulting file in the S3. When all chunks are processed, a dedicated script will merge the results in a single file and send you a notification.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
examples		examples
img		img
scripts		scripts
LICENSE		LICENSE
README.md		README.md
metagraph-ami.yaml		metagraph-ami.yaml
metagraph-stack.yaml		metagraph-stack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MetaGraph Sequence Indexes

Overview

Data

Summary

Available categories

Dataset layout

Chunk structure

Usage within AWS

Configure the `aws` CLI tool

Clone the project

Deploy the Cloud Formation template

Upload your query to the S3 bucket

Submit a job

Large query example

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

ratschlab/metagraph-open-data

Folders and files

Latest commit

History

Repository files navigation

MetaGraph Sequence Indexes

Overview

Data

Summary

Available categories

Dataset layout

Chunk structure

Usage within AWS

Configure the aws CLI tool

Clone the project

Deploy the Cloud Formation template

Upload your query to the S3 bucket

Submit a job

Large query example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Configure the `aws` CLI tool

Packages