KWAGE: Kmer Warehousing Approach for Genomic Exploration (and also the name of a local trail in Los Alamos, New Mexico)

A C++ implementation of the ultra-fast, Bloom filter-based sequence search originally developed by the Iqbal group. This project is similar in spirit to the Iqbal group's BIGSI follow-on project: COBS: a Compact Bit-Sliced Signature Index. While the goals are very similar, the KWAGE project is an independent implementation of the original BIGSI algorithm with the addition of the following features:

Parallel, C++ implementation (using both OpenMP and MPI)
Adaptive Bloom filter size (similar to COBS)
Compressed database files
Bloom filter construction directly from Sequence Read Archive (SRA) files

The KWAGE pipeline has several components:

sra_inventory
- Parse and convert the NCBI-supplied SRA metadata information into a single binary file. This file is read by the maestro program to add metadata and annotation information to the final database files.
maestro
- Streaming Bloom filter construction from SRA accessions.
- Aggregation and transposition of of individual Bloom filters into database files.
kwage
- Searching the resulting database with a nucleic acid query.

The following helper applications are also provided:

dump_db, a tool for dumping the header, annotation, and first few bit slices of a database file.
SriRachA, a tool for per-read, kmer-based searching of SRA records against a set of user-provided query sequences. This tool is useful for confirming and investigating the per-accession matches reported by KWAGE. This tool is contained in the SriRachA/ subdirectory (with a separate Makefile and documentation).

Building and installing the code

The KWAGE code is written in C++ and requires the MPI (message passing interface; the code has been tested using OpenMPI). A compiler that supports OpenMP is also suggested, but not required.

Currently, downloading the SRA records is orchestrated by the maestro program. While previous versions relied on a cluster scheduling system (like Slurm), this dependancy has been removed. As discussed below, downloading the entire SRA to a private network (i.e. non-AWS, non-Google) is not going to work -- downloading will take too long for the several petabytes of SRA data.

During the database creation process, database files are written to an AWS S3 bucket (to avoid exhausting expensive local storage). AWS command line tools are used to copy database files to S3. This storage approach will need to be modified to run on a different cloud provider (i.e. Google). The dependancy on AWS command line tools can be avoided by specifying the --s3.no-write flag to the maestro program. Depending on the number of SRA records being indexed, the storage requirements for the database files can be significant!

Step 1: Download the SRA toolkit and associated APIs

In addition, the SRA toolkit and associated APIs are required if you would like to be able to download and/or construct Bloom filters from SRA files. In particular, you will need to download and build the following three GitHub projects from NCBI:

Please note that these NCBI libraries are only needed if you will need to either download SRA files and/or create Bloom filters from SRA files. If you only need to either build a database from existing Bloom filters or search an existing database, then these NCBI libraries are not needed.

How to actually build the SRA toolkit packages

It can be a little tricky to compile the ncbi-vdb, ngs and sra-tools packages. Here is a step-by-step guide (with specific package install commands for a Red Hat-like OS that uses the yum package manager):

Install the git repository control software (e.g. sudo yum install git).
Install a c/c++ compiler (e.g. sudo yum group install "Development Tools").
Install a Java development kit (while Java is not used in KWAGE, a jdk is needed to compile the ncbi.ngs package).
Install libxml2 development libraries (e.g. sudo yum install libxml2-devel.x86_64). Many Linux distributions ship with libxml2 but not the development headers, which will be needed.
Create a directory to hold all of the NCBI SRA software (e.g. mkdir SRA). The rest of these instructions assume that the path to the SRA packages is $HOME/SRA.
Download sra-tools from GitHub (e.g. git clone https://github.com/ncbi/sra-tools.git).
Download ngs from GitHub (e.g. git clone https://github.com/ncbi/ngs.git).
Download ncbi-vdb from GitHub (e.g. git clone https://github.com/ncbi/ncbi-vdb.git).
Compile the NCBI packages in the following order (order is important!):

# Build and install ncbi-vdb
cd $HOME/SRA/ncbi-vdb
./configure --prefix=$HOME/SRA \
	--with-ngs-sdk-prefix=$HOME/SRA/ngs

# Before building on Amazon's AWS (and depending on choice of image/instance), you may need to remove the `-Wa,-march=generic64+sse4` flags from `ncbi-vdb/libs/krypto/Makefile` prior to running `make`
make
make install

# Build and install ngs/ngs-sdk (a sub-package of ngs)
cd $HOME/SRA/ngs/ngs-sdk
./configure --prefix=$HOME/SRA/ngs/ngs-sdk
make
make install

# Build and install ngs/ngs-java
cd $HOME/SRA/ngs/ngs-java
./configure --prefix=$HOME/SRA
make
make install

# Build and install ngs
cd $HOME/SRA/ngs
./configure --prefix=$HOME/SRA \
	--with-ncbi-vdb-prefix=$HOME/SRA/ncbi-vdb \
	--with-ngs-sdk-prefix=$HOME/SRA/ngs/ngs-sdk
make
make install

# Build and install sra-tools
cd cd $HOME/SRA/sra-tools
./configure --prefix=$HOME/SRA \
	--with-ngs-sdk-prefix=$HOME/SRA/ngs/ngs-sdk \
	--with-ncbi-vdb-sources=$HOME/SRA/ncbi-vdb

make
make install

Run the vdb configuration tool: $HOME/SRA/bin/vdb-config --interactive.
- Select the [X] report cloud instance identity option
- At this time, do not select the [ ] accept charges for AWS.
- Select Save
- Select Exit

The options you select here will depend on your local compute environment (e.g. cloud vs local).

Step 2: Edit the KWAGE Makefile

After the three separate NCBI software packages have been downloaded and compiled, you will need to edit the KWAGE Makefile to specify the locations of both the include and library directories for the ncbi-vdb and ngs packages.

Edit the SRA_LIB_PATH variable to specify the directory on your system that contains the SRA library files.

Edit the SRA_INCLUDE_PATH variable specify the directory on your system that contains the SRA include files.

Edit the NCBI_VDB_INCLUDE_PATH variable specify the directory on your system that contains the NCBI-VDB include files.

Step 3: Run `make -j`

Running make in the KWAGE directory will build all pipeline components. There is no separate installation step required. If you encounter any problems, please email jgans@lanl.gov for assistance.

Format the NCBI SRA metadata information

The sra_inventory program is responsible for formatting the NCBI SRA metadata for inclusion in the final database files. The NCBI provides a compressed tar file of XML metadata dowloaded on their FTP site. Do not uncompress or untar this file! Since the uncompressed/untarred files are large and cumbersome to manage, the sra_inventory program directly reads the SRA metadata (as a compressed tar file) and converts this data into a single binary file that can be read by the maestro program that builds the database files from SRA records. The command line arguments for the sra_inventory program are:

Usage for SRA inventory:
	-i <XML metadata file from ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata>
	[-o <binary output file>]
	[--list (list, but do not write binary SRA inventory)]
	[--date.from <YYYY-MM-DD>] (only download SRA records received after this date)
	[--date.to <YYYY-MM-DD>] (only download SRA records received before this date)
	[--strategy <strategy key word>] (only download SRA records that match one of the specified experimental strategies)
		Examples include: RNA-Seq, WGS, AMPLICON, Bisulfite-Seq, ... (case sensitive!)
	[--source <source key word>] (only download SRA records that match one of the specified exterimental sources)
		Examples include: TRANSCRIPTOMIC, GENOMIC, METAGENOMIC, METATRANSCRIPTOMIC, ... (case sensitive!)
	[--include <list of SRA run accessions>] (only download SRA records that match one of the specified SRA runs)

The argument to the -i flag is the compressed tar file of XML metadata dowloaded from the NCBI FTP site.
The argument to the -o flag is the name of the binary output file (it is required, unless the the --list flag is also specified).
The --list flag lists the SRA records found in the XML metafile (and suppressed the creation of the binary output file).
The argument to the --date.from flag restricts the download to SRA runs that were received after this date
The argument to the --date.to flag restricts the download to SRA runs that were received before this date
The argument to the --strategy flag restricts the download to SRA runs that have a matching experimental strategy.
- This command can be repeated multiple times.
- The string matching is case sensitive.
- A partial list of experimental strategies include:
  - RNA-Seq
  - WGS
  - AMPLICON
  - Bisulfite-Seq
The argument to the --source flag restricts the download to SRA runs that have a matching experimental source.
- This command can be repeated multiple times.
- The string matching is case sensitive.
- A partial list of experimental sources include:
  - TRANSCRIPTOMIC
  - GENOMIC
  - METAGENOMIC
  - METATRANSCRIPTOMIC
The argument to the --include flag restricts the download to SRA accession contained within the specified file. The SRA accessions must be listed in an ASCII text file, with one accession per line.

SRA annotation information

Currently, SRA annotation information is extracted from the XML metadata dowloaded from the NCBI and included in the Bloom filter files and the final KWAGE database. Unless noted, information is stored as a string. Since not all fields are provided for every SRA record, the string "NA" indicates a missing field.

Run
- Accession
Experiment
- Accession
Experiment
- Title
- Design description
- Library name
- Library strategy
- Library source
- Library selection
- Instrument model
Sample
- Accession
- Taxa
- Attributes (as a collection of key/value pairs)
Study
- Accession
- Title
- Abstract

SRA download and database construction

The maestro program is responsible for streaming download of SRA records (i.e. downloading SRA files → creating Bloom filters → removing SRA files), packing database files with multiple Bloom filters and copying the database files to long-term (e.g. AWS S3) storage. The command line arguments for the maestro program are:

Usage for Maestro:
	--scratch <parent directory for staging Bloom filter and database files>
	[--scratch.bloom <scratch directory for staging Bloom filter>]
	[--scratch.database <scratch directory for staging database files>]
	--meta <binary metadata input file>
	--s3 <S3 path for storing database files>
	[--s3.no-write (do *not* write database files to s3)]
	[--stream (stream SRA data -- do not use prefetch to download!)]
	[--max-sra-download <max allowed SRA file size in GB>] (default is 30 GB)
	[--status <binary SRA status file for restart>] (default is ./__sra_db_status.bin)
	[--retry <number of download attempts>] (default is 3)
	[--retry.bloom (retry all failed Bloom filters)]
	[--delay <minimum number of seconds between download/streaming requests>]
	[--halt-after <halt after this many SRA downloads> (default is not to stop)]
	[-k <kmer length>] (default is 31)
	[-p <false positive probability (per k-mer, per-filter)>] (default is 0.25)
	[--min-kmer-count <minimum allowed k-mer count>] (default is 5)
	[--hash <hash function name>] (default is murmur32)
		Allowed hash functions: murmur32
	[--len.min <log2 Bloom filter len>] (default is 18)
	[--len.max <log2 Bloom filter len>] (default is 32)
	[-v (turn on verbose output)]
	[--save.bloom (don't remove Bloom filters after database construction)]
	[--save.db (don't remove database file after S3 upload)]
	[--save.sra (don't remove SRA files after Bloom filter construction)]
	[--skip <SRA run accession> (skip over the specified accession; may be repeated)]

The argument to the --scratch flag specifies the parent directory for staging intermediate files (e.g. SRA files, Bloom filter files and database files). How much scratch space is needs? Good question! The amount of space is difficult to estimate, as it depends on the number of MPI ranks provided to maestro. As a rule of thumb, we recommend providing approximately 2-3 TB of scratch space when running with 150 - 200 MPI ranks. The scratch space must be available to all MPI rank and, when running on AWS, the Lustre FSX filesystem appears to work well.
The argument to the optional --scratch.bloom flag specifies the directory for staging intermediate Bloom filter files.
The argument to the optional --scratch.database flag specifies the directory for staging intermediate database files.
The argument to the --meta flag is the binary file of metadata created by the sra_inventory program described above. This file defines the SRA accessions that will be downloaded, converted into Bloom filters and stored in database files.
The argument to the --s3 flag is the S3 path where the final database files will be stored. Database files are copied to S3 to save expensive local storage and to facilitate the moving of database files to other systems.
The the optional --s3.no-write flag supresses the writing of database files to s3. This avoids the dependancy on AWS command line tools. However, when indexing a large number of SRA records, the space required for storing the database files can be significant, so be prepared!
The optional --stream flag using the NCBI SRA API to stream SRA records (instead of using the NCBI prefetch command). While these two approaches are quite similar (both require downloading the "helper" files that are associated with some of the individual SRA records), streamming appears to be slightly more reliable when running on AWS and writing to the FSX/Lustre filesystem. We have observed cases where the prefetch tool appears to stall when writing an .sra file to FSX/Lustre.
The argument to the optional --max-sra-download flag specifies the maximum size of downloaded SRA records in GB (by passing this argument to the prefetch command). Limiting the size of SRA downloads is not currently implemented when streaming SRA records using the --stream flag.
The argument to the optional --status flag specifies the binary file (by default ./__sra_db_status.bin) that is used to store the current state of the maestro program. This file is essential for restarting the maestro program after a crash or system failure.
The argument to the optional --retry flag specifies the number of SRA download attempts before giving up on a specific SRA record. The default number of atempts is 3.
The optional --retry.bloom flag forces the program to retry all of the SRA records that previously failed to generate a valid Bloom filter.
The argument to the optional --delay flag specifies the number of seconds to wait between SRA download or streaming requests. By default, there is no delay.
The argument to the optional --halt-after flags specifies the number of SRA records to index (download and convert to a Bloom filter) before halting. This is useful for debugging. By default, the program halts after all requested SRA records are indexed.
The argument to the optional -k flag is the integer k-mer length. The default k-mer length is 31 and the maximum allowed k-mer length is 32.
The argument to the optional -p flag is the desired Bloom filter false positive rate (per k-mer and per-filter). The default value is 0.25. This rate must be between 0 and 1, and it used to determine the Bloom filter parameters (i.e. number of bits and number of hash functions used per-k-mer). SRA files that have more k-mers than can be satisfied by the desired false positive rate threshold (for the specified limits of Bloom filter length and number of hash functions) are skipped.
The argument to the optional --min-kmer-count flag is the minimum number of times a given k-mer must be found in an SRA file. The default value is 5. The purpose of this threshold is to "de-noise" (i.e. remove sequencing errors from) SRA files. Sequencing errors can generate large numbers of unique, but not biologically relevant k-mers. Larger minimum k-mer counts significantly reduce the size of the Bloom filter needed to represent an SRA file, but also introduce the possibility of false negative search results.
The argument to the optional --hash flag is the name of the hash function to use when constructing the Bloom filter. Currently, only the 32-bit murmur hash function is supported. In the future, additional hash functions may be added (e.g. 64-bit murmur hash).
The argument to the optional len.min flag is the log₂(minimum Bloom filter length in bits) to test when computing the optimal Bloom filter length. The default value is 18.
The argument to the optional len.max flag is the log₂(maximum Bloom filter length in bits) to test when computing the optimal Bloom filter length. The defaut value is 32.
The optional flag -v turns on verbose output.
The optional flag --save.bloom prevents the deletion of the intermediate Bloom filter files after database construction. Useful for debugging.
The optional flag --save.db prevents the deletion of the intermediate database files after uploading database files to AWS S3 storage. Useful for debugging.
The optional flag --save.sra prevents the deletion of the intermediate .sra files (and associated helper files) after Bloom filter construction. Useful for debugging and only applies when the --stream option is not specified.
The argument to the optional --skip flag is an ASCII text file of SRA accessions to skip. The file must contain one SRA accession per line.

Searching the database

The kwage program enables searching the database files with one or more user-supplied DNA sequences. Currently, all of the database files must be stored on a POSIX filesystem (i.e. not on AWS S3).

The command line arguments are:

Usage for kwage:
	[-o <output file>] (default is stdout)
	[--o.csv (output CSV) | --o.json (output JSON)]
	[-t <search threshold>] (default is 1)
	-d <database search path> (can be repeated)
	[-i <input sequence file>] (can be repeated)
	[<DNA sequence>] (can be repeated)

The argument to the optional -o flag specifies the name of the output file of search results. By default, search results are written to stdout to enable piping to other programs.
- The output format of the search results can be either comma separated values (--o.csv) or JSON format (`--o.json). The default output format is JSON.
The argument to the optional -t flag species the search threshold in terms of the minimum fraction of query sequence kmers that must match a individual database records. Valid threshold values must be greater than 0 and less than, or equal to, 1. The default search threshold is 1 (i.e. all query sequence kmers must be found in the same database record to report a match).
The argument to the -d flag specifies the parent directory for all database files. The parent directory and all subdirectories are recursively searched for database files (i.e. .db or .dbz).
The argument to the optional -i flag is the name of a fasta or fastq file of query sequences that will be searched against all of the specified database files.
Multiple DNA sequences (delimited by white space) can be directly specified on the command line.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
SriRachA		SriRachA
ChangeLog.md		ChangeLog.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bff.cpp		bff.cpp
binary_io.cpp		binary_io.cpp
binary_io.h		binary_io.h
bloom.cpp		bloom.cpp
bloom.h		bloom.h
bloom_diff.cpp		bloom_diff.cpp
bloom_test.cpp		bloom_test.cpp
bloom_test_buffer.cpp		bloom_test_buffer.cpp
build_db.cpp		build_db.cpp
date.cpp		date.cpp
date.h		date.h
db_debug.cpp		db_debug.cpp
dump_bloom.cpp		dump_bloom.cpp
dump_db.cpp		dump_db.cpp
fdebug.cpp		fdebug.cpp
file_io.cpp		file_io.cpp
file_util.cpp		file_util.cpp
file_util.h		file_util.h
foo.cpp		foo.cpp
hash.cpp		hash.cpp
hash.h		hash.h
ifind.cpp		ifind.cpp
ifind.h		ifind.h
inventory_dump.cpp		inventory_dump.cpp
invert.cpp		invert.cpp
keys.h		keys.h
kwage.cpp		kwage.cpp
kwage.h		kwage.h
maestro.cpp		maestro.cpp
maestro.h		maestro.h
maestro_main.cpp		maestro_main.cpp
make_bloom.cpp		make_bloom.cpp
manual_db.cpp		manual_db.cpp
mem_usage.cpp		mem_usage.cpp
mem_usage.h		mem_usage.h
merge_db.cpp		merge_db.cpp
mpi_util.cpp		mpi_util.cpp
mpi_util.h		mpi_util.h
options.cpp		options.cpp
options.h		options.h
output.h		output.h
parse_sequence.cpp		parse_sequence.cpp
parse_sequence.h		parse_sequence.h
parse_tar.cpp		parse_tar.cpp
parse_tar.h		parse_tar.h
slice_z.h		slice_z.h
sort.h		sort.h
split.cpp		split.cpp
split.h		split.h
sra_accession.cpp		sra_accession.cpp
sra_accession.h		sra_accession.h
sra_diff.cpp		sra_diff.cpp
sra_download.cpp		sra_download.cpp
sra_dump.cpp		sra_dump.cpp
sra_inventory.cpp		sra_inventory.cpp
sra_meta.cpp		sra_meta.cpp
string_conversion.cpp		string_conversion.cpp
string_conversion.h		string_conversion.h
word.cpp		word.cpp
word.h		word.h
worker_main.cpp		worker_main.cpp

License

LANL-Bioinformatics/KWAGE

Folders and files

Latest commit

History

Repository files navigation