Skip to content

Sequence Download and Setup Quick Start Guide

ktaed edited this page Jan 3, 2020 · 33 revisions

Activate the environment that was setup in Installation

conda activate [ENV_NAME]

or

source activate [ENV_NAME]

Quick Start

The oneclick option will run the download, sequence decompression/parsing, sequence partitioning, and fm-index building using default options (see Explanation of Processing Steps below for more details). If the '--path' is not specified a folder with current date will be created in the working directory. The '--thread' option can be used to specify the total number of cores available for use by MTSv.

mtsv_setup oneclick

Examples (Possible Use Cases)

  1. Default databases with custom partitions
  2. Custom partitioning of selected database
  3. Transferring to air gapped systems
  4. Default partitions from custom flat files list

Explanation of Processing Steps

MTSv makes use of NCBI GenBank and RefSeq FASTA format sequences for its alignment-based classification algorithm. This portion of the pipeline is split into two modules: database and custom_db. The database module, automates the download, decompression, and parsing of GenBank flat files from NCBIs FTP website. The custom_db module, partitions sequence data based on the NCBI Taxonomy and splits them into a set of files that are used to build
FM-indices. The partitioning step is recommended/required due to the spike in RAM usage (15 to 30 times fasta file size) during the FM-index creation. By tweaking the chunk size the pipeline can be utilized on desktops and high-performance computing clusters given adequate storage space.

The downloading step can be circumvented if the user wants to create a database using existing local flat files. This requires the user to create a file list of GenBank flat files (can be unzipped) and acquire a mirror of the NCBI Taxonomy from the FTP site. Please see the use case examples for details.

  1. database
    • download NCBI GenBank FlatFiles (GenBank, RefSeq)
    • FASTA datastore creation
  2. custom_db
    • Partition FASTA by TaxID
    • Chunk FASTA into 2GB (default) files.
    • Build FM-Indices
  • Oneclick performs the above steps with default settings. User definable options explained below.

The default files downloaded are the latest GenBank Release and genome assemblies from GenBank/RefSeq at the Complete Genome assembly_level. Chromosome, Scaffold, and Contig assembly_level may also be specified. The files needed to build the NCBI Taxonomy are also downloaded.

Commands

General use of setup commands:

mtsv_setup [command] [arguments]

database

To use this command: mtsv_setup database [arguments]

The database command performs the downloading and building of a FASTA datastore. The available optional commands are:

  • --path      specify the location of directory to be built [default: Creates a folder of current date in working directory]
  • --includedb      list of sequence sources to download [default contains "genbank", "Complete Genome", "Chromosome", "Scaffold"]
    • The FASTA datastores will be roughly 250GB, 90GB, 500GB, 2TB, respectively.
  • --download_only      Perform only the download step on required files
  • --build_only      Perform only the FASTA datastore build step. Requires a path created using download_only.
  • --thread      Specify available cores
  • --ff_list      Used with build_only to create a local database from a file list of GenBank Flat files
  • --taxonomy_path      Used with build_only and should be a local mirror of ftp://ftp.ncbi.nih.gov/pub/taxonomy/ directory

See Examples 1, 2, and 4 above.

custom_db

To use this command: mtsv_setup custom_db [arguments]

The custom_db command partitions sequences. The available optional commands are:

  • --path      specify the location of directory to be built by database command
  • --customdb      list of sources to create fm-indices [default "genbank"]
  • --rollup_rank      Associate sequences with taxid at provided taxonomic rank [default "species"]
  • --chunk_size      Specify the GB size of FASTA files for building FM-indices. The indices will be 15 to 20 times this size. [default 2]
  • --overwrite      Force rebuild of partitions
  • --thread      Specify available cores
  • --partitions      list if TaxIDs to include minus TaxIDs to exclude in a desired fm-index see below for formatting and default explanation.

Common NCBI Tax ID Partitions

By default custom_db will split GenBank sequence data into 12 TaxID partitions:

  • "2,2157"      All Bacteria (Bacteria and Archaea)
  • "10239,12884"      All Viruses (Viruses and Viroids)
  • "28384"      Other Sequences (Other and synthetic sequences)
  • "2759-33090,4751,7742"      Eukaryotes minus plants, fungi, vertebrates
  • "33090"      Plants
  • "4751"      Fungi
  • "7742-9443,9397,9913,9615"      Vertebrates minus primates, Chiroptera, Bos taurus, Canis lupus
  • "9443-9606"      Primates minus Homo sapiens
  • "9397"      Chiroptera (Bats)
  • "9913"      Bos taurus (Cow)
  • "9615"      Canis lupus familiaris (Dog)
  • "9606"      Homo sapiens

The formatting convention used can be thought of as a list of set operations between comma separated TaxIDs to include and exclude. For example, a Partition of Bacilli (91061) without B. anthracis (1392) or B. cereus (1396) could be specified with the string "91061-1396,1392". The "-" denotes the difference set operation and can be left out if no exclusion is desired.

For more Tax IDs, visit the NCBI Taxonomy Browser.

See Examples 1, 2, and 4 above.