Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Please visit our homepage:
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 39 commits ahead, 18 commits behind poeli:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Genomic Origin Through Taxonomic CHAllenge (GOTTCHA)

GOTTCHA is an application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly smaller false discovery rates (FDR) that is laptop deployable. Our algorithm was tested and validated on twenty synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools.


Linux (2.6 kernel or later) or Mac (OSX 10.6 Snow Leopard or later) operating system with minimal 8 GB of RAM is recommended. Perl v5.8 or above is required. The C/C++ compiling enviroment might be required for installing dependencies. Systems may vary. Please assure that your system has the essential software building packages (e.g. build-essential for Ubuntu, XCODE for Mac...etc) installed properly before running the installing script.

GOTTCHA was tested successfully on our Linux servers (Ubuntu 12.10 w/ Perl v5.14.2; Ubuntu 10.04 w/ Perl v5.10.1) and Macbook Pro laptops (MAC OSX 10.8 w/ XCODE v5.1).


This is an example of profiling a "test.fastq" file using GOTTCHA with a species-level pre-computed bacterial database. The testing FASTQ file comes along with the GOTTCHA package in the "test" directory. More details are stated in the INSTRUCTION section.

  1. Obtaining GOTTCHA package:

     $ git clone gottcha
  2. Installing GOTTCHA:

     $ cd gottcha
     $ ./
  3. Downloading lookup table and species-level database from our web server:

     $ wget
     $ wget

    If you have any difficulty obtaining the databases, please contact Po-E Li

  4. Unpacking and decompressing archives:

     $ tar -zxvf GOTTCHA_lookup.tar.gz
     $ tar -zxvf GOTTCHA_BACTERIA_c4937_k24_u30_xHUMAN3x.species.tar.gz
  5. Running

     $ bin/             \
          --threads 8             \
          --outdir ./             \
          --input test/test.fastq \
          --database database/GOTTCHA_BACTERIA_c4937_k24_u30_xHUMAN3x.species
  6. Enjoying the result at './test.gottcha.tsv'.


The detail of steps in the above section will be descrbed in this section. Note that all instructions in this document use pre-computed databases downloaded from our web site.

If you are looking for instructions to build a CUSTOM database and/or running GOTTCHA step-by-step, please read

Obtaining GOTTCHA

The source codes can be downloading from here. The pre-computed databases need to be downloaded separately from our web server. Please see below in the [Obtaining Pre-computed Databases] section.

You can use "git" to obtain the package:

    $ git clone

or download the compressed archive in zip


The GOTTCHA profiling and database-generating scripts are primarily Perl-based, and require at least Perl v5.8 with dependencies installed properly (listed in The splitrim tool is written in D that requires an appropriate D compiler to complie it. GOTTCHA utilizes BWA with the BWA-MEM algorithm for read mapping. You can either keep "dmd" and "bwa" in your system path or simply run the installation script - This script will check and try to install missing tools and dependencies:

	$ ./

After running successfully, the binaries and related scripts will be stored in the ./bin directory.

Obtaining Pre-computed Databases

Databases of unique genome segments at multiple taxonomic levels (e.g. family, species, genus, strain-level, etc.) are used for taxonomic classification of reads. Variants of these databases, in which all human 24-mers were removed were also generated and used in this study. These 24-mers were derived from the GRCh37.p10 (Genome Reference Consortium), HuRef (J. Craig Venter Institute), and CHM1_1.0 (Washington U. School of Medicine) assemblies and include unplaced scaffolds. For example, GOTTCHA_BACTERIA_c3514_k24_u24_xHUMAN3x.species.tar.gz is a GOTTCHA bacterial species-level signature database that was produced by eliminating shared 24-mer (k24) sequences from 3514 bacterial replicons (c3514; includes both chromosomes and plasmids) and 3 human genomes (xHuman3X), while retaining a minimum of 24bp of unique fragments (u24).

The compressed database archives are available for users to download from our ftp server:

GOTTCHA requires a taxanomic lookup table (GOTTCHA_lookup.tar.gz) and a pre-computed database (e.g: GOTTCHA_BACTERIA_c3514_k24_u24_xHUMAN3x.species.tar.gz) to classify reads. These signature databases could be huge. We highly recommend that users also download the corresponding *.md5 file for verification.

You can use the 'wget' command to download both archives, one at a time:

    $ wget
    $ wget

Then use 'tar' to unpack and decompress both archives:

    $ tar -zxvf GOTTCHA_lookup.tar.gz
    $ tar -zxvf GOTTCHA_BACTERIA_c4937_k24_u30_xHUMAN3x.species.tar.gz

Files will be expanded to ./database directory by default.

Note: The plasmid related results and the option need the new parsed database to work properly. For users who downloaded the databases before 30th March 2015, we encourage you to download the corresponding new parsed database (*.parsedGOTTCHA.dmp) to have a taste of new plasmid related feature. GOTTCHA v1.0 still supports old databases but plasmid relative results will always be shown as zero due to absence of plasmid information. The new .dmp files are about 130MB size in gzipped format. Please unzipped the file and use it to replace the old one. The general location of file is "gottcha/parsed_database_only/<DATABASE_NAME>.parsedGOTTCHA.dmp.gz". For example, to download the new dmp file for species level database:

 $ wget

Here is a list of the available pre-computed databases. Note that these databases are also available in FASTA format at gottcha/[VERSION]/FASTA/..

Note: If you have any difficulty obtaining the databases, please contact Po-E Li


The procedure includes 3 major steps: (1) split-trimming the input data, (2) mapping reads to a GOTTCHA database using BWA, and (3) profiling/filtering the results. These steps have been wrapped into a sigle script called ''. User will need to provide a FASTQ file as input and specify the location and name of the database.

Here is the general usage to run GOTTCHA:


We provided a testing FASTQ file and example output in ./test. The following command is an example that runs "test.fastq" through GOTTCHA using a species-level database with 8 threads:

    $ bin/             \
         --threads 8             \
         --mode all              \
         --input test/test.fastq \
         --database database/GOTTCHA_BACTERIA_c3514_k24_u24_xHUMAN3x.species

In this case, we specify the output mode to "all" using "--mode all" option that gives us two output files and all intermediate ouptuts stored in "test_temp" directory. Both outputs are plain text files in tab-separated values format: a summary table "test.gottcha.tsv" and a full information table "test.gottcha_full.tsv".

Interpreting Results

GOTTCHA reports profiling results in a neat summary table (*.gottcha.tsv) by default. The tsv file will list the organism(s) at all taxonomic levels from STRAIN to PHYLUM, their linear length, total bases mapped, linear depth of coverage, and the normalized linear depth of coverage. The linear depth of coverage (LINEAR_DOC) is used to calculate relative abundance of each organism or taxonomic name in the sample.

Summary table:

Column Description
LEVEL taxonomic rank
NAME taxonomic name
REL_ABUNDANCE relative abundance (equivalent to ROLLUP_DOC by default)
LINEAR_LENGTH number of non-overlapping bases covering the signatures
TOTAL_BP_MAPPED summation of all hit lengths recruited to signatures
HIT_COUNT number of hits recruited to signatures
HIT_COUNT_PLASMID number of hits recruited to signatures
READ_COUNT number of reads recruited to signatures
LINEAR_DOC linear depth-of-coverage (TOTAL_BP_MAPPED / LINEAR_LENGTH)
NORM_COV normalized linear depth-of-coverage (LINEAR_DOC / Σi=taxonomies in the centain level LINEAR_DOCi)
ROLLUP_DOC summation of average depth of coverage of unique signature length of each strain belonging to this taxonomy ( Σi=strains ( TOTAL_BP_MAPPEDi / total_unique_signature_lengthi) )

There are two report modes available. Other than a summary table, "full" report mode will report a table with more detail information from unfiltered results. The explanation of each column in the full report can be found in The "all" report mode will keep all output files that were generated by each profiling step.

Visualizing Results using Krona

Krona is an interactive browser that allows the exploration of hierarchical data with pie charts. Assuming you have Krona installed properly, we are going to create a Krona chart from a text file listing abundance and lineages. You will find .lineage.tsv file when you run in "all" output mode.

Use 'ktImportText' and save the chart to "test.krona.html":

    $ ktImportText test_temp/test.lineage.tsv -o test.krona.html


Tracey Allen K. Freitas, Po-E Li, Matthew B. Scholz and Patrick S. G. Chain (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures, Nucleic Acids Research (DOI: 10.1093/nar/gkv180)


Tracey Allen K. Freitas, Po-E Li, Matthew B. Scholz, Patrick S. G. Chain Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545


We would like to thank Jason Gans for critical discussions on classification and machine learning techniques, and Shihai Feng for the generation of synthetic datasets.

This project is funded by U.S. Defense Threat Reduction Agency [R-00059-12-0 and R-00332-13-0 to P.S.G.C.].