-
Notifications
You must be signed in to change notification settings - Fork 2
Sequence Download and Setup Quick Start Guide
Activate the environment that was setup in Installation
conda activate [ENV_NAME]
or
source activate [ENV_NAME]
The oneclick option will run the download, sequence decompression/parsing, sequence partitioning, and fm-index building using default options (see Explanation of Processing Steps below for more details). If the '--path' is not specified a folder with current date will be created in the working directory. The '--thread' option can be used to specify the total number of cores available for use by MTSv.
mtsv_setup oneclick
- Default databases with custom partitions
- Custom partitioning of selected database
- Transferring to air gapped systems
- Default partitions from custom flat files list
MTSv makes use of NCBI GenBank and RefSeq FASTA format sequences for its alignment-based classification algorithm.
This portion of the pipeline is split into two modules: database and custom_db. The database module, automates the
download, decompression, and parsing of GenBank flat files from NCBIs FTP website. The custom_db module,
partitions sequence data based on the NCBI Taxonomy and splits them into a set of files that are used to build
FM-indices. The partitioning step is recommended/required due to the spike in RAM usage (15 to 30 times fasta file size)
during the FM-index creation. By tweaking the chunk size the pipeline can be utilized on desktops and high-performance
computing clusters given adequate storage space.
The downloading step can be circumvented if the user wants to create a database using existing local flat files. This requires the user to create a file list of GenBank flat files (can be unzipped) and acquire a mirror of the NCBI Taxonomy from the FTP site. Please see the use case examples for details.
- database
- download NCBI GenBank FlatFiles (GenBank, RefSeq)
- FASTA datastore creation
- custom_db
- Partition FASTA by TaxID
- Chunk FASTA into 2GB (default) files.
- Build FM-Indices
- Oneclick performs the above steps with default settings. User definable options explained below.
The default files downloaded are the latest GenBank Release and genome assemblies from GenBank/RefSeq at the Complete Genome assembly_level. Chromosome, Scaffold, and Contig assembly_level may also be specified. The files needed to build the NCBI Taxonomy are also downloaded.
General use of setup commands:
mtsv_setup [command] [arguments]
To use this command:
mtsv_setup database [arguments]
The database command performs the downloading and building of a FASTA datastore. The available optional commands are:
- --path specify the location of directory to be built [default: Creates a folder of current date in working directory]
- --includedb list of sequence sources to download [default contains "genbank", "Complete Genome", "Chromosome", "Scaffold"]
- The FASTA datastores will be roughly 250GB, 90GB, 500GB, 2TB, respectively.
- --download_only Perform only the download step on required files
- --build_only Perform only the FASTA datastore build step. Requires a path created using download_only.
- --thread Specify available cores
- --ff_list Used with build_only to create a local database from a file list of GenBank Flat files
- --taxonomy_path Used with build_only and should be a local mirror of ftp://ftp.ncbi.nih.gov/pub/taxonomy/ directory
To use this command:
mtsv_setup custom_db [arguments]
The custom_db command partitions sequences. The available optional commands are:
- --path specify the location of directory to be built by database command
- --customdb list of sources to create fm-indices [default "genbank"]
- --rollup_rank Associate sequences with taxid at provided taxonomic rank [default "species"]
- --chunk_size Specify the GB size of FASTA files for building FM-indices. The indices will be 15 to 20 times this size. [default 2]
- --overwrite Force rebuild of partitions
- --thread Specify available cores
- --partitions list if TaxIDs to include minus TaxIDs to exclude in a desired fm-index see below for formatting and default explanation.
By default custom_db will split GenBank sequence data into 12 TaxID partitions:
- "2,2157" All Bacteria (Bacteria and Archaea)
- "10239,12884" All Viruses (Viruses and Viroids)
- "28384" Other Sequences (Other and synthetic sequences)
- "2759-33090,4751,7742" Eukaryotes minus plants, fungi, vertebrates
- "33090" Plants
- "4751" Fungi
- "7742-9443,9397,9913,9615" Vertebrates minus primates, Chiroptera, Bos taurus, Canis lupus
- "9443-9606" Primates minus Homo sapiens
- "9397" Chiroptera (Bats)
- "9913" Bos taurus (Cow)
- "9615" Canis lupus familiaris (Dog)
- "9606" Homo sapiens
The formatting convention used can be thought of as a list of set operations between comma separated TaxIDs to include and exclude. For example, a Partition of Bacilli (91061) without B. anthracis (1392) or B. cereus (1396) could be specified with the string "91061-1396,1392". The "-" denotes the difference set operation and can be left out if no exclusion is desired.
For more Tax IDs, visit the NCBI Taxonomy Browser.