Skip to content

SJTU-CGM/CDKAM

Repository files navigation

CDKAM

Copyright 2019-2020
Author: Van-Kien Bui, Chaochun Wei
Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University

1) Introduction

Classification tool using Discriminative K-mers and Approximate Matching algorithm (CDKAM) is a new metagenome sequence classification tool for the third generation sequencing data with high error rates.

2) Requirements

It is suggested to install BLAST+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/), which has already included dustmasker. Low-complexity sequences, e.g. "ACACACACACACACACA", "ATATATATATATATATAT" are known to occur in many different organisms and are typically less informative for use in alignments. The masked regions are not processed further by CDKAM.

3) Datasets

Datasets can be found at OneDrive:

4) Installation

5) Running CDKAM

It might take 6-8 hours for downloading the reference genomes (about 85 GB), and 24-30 hours for building the database.
To build CDKAM with different value of X, please change the value of variable RANGE in /src/compress.cpp file as follows:
RANGE = 5 for X = 20%
RANGE = 7 for X = 15%
RANGE = 10 for X = 10%
RANGE = 20 for X = 5%
The default version of CDKAM selects X = 15% with RANGE = 7.

  • Downloading database:
    Standard installation with archaea, bacteria and viral reference genomes
    ./download --standard --db $DBname
    Or custom installation
    mkdir $DBname
    ./download_taxonomy.sh $DBname
    ./download --download-library archaea --db $DBname
    ./download --download-library bacteria --db $DBname
    ./download --download-library fungi --db $DBname
    ./download --download-library viral --db $DBname
    ./download --download-library human --db $DBname

  • Building database:
    ./build_database.sh $DBname

  • Running classification by default (using approximate matching strategies):
    ./CDKAM.sh $DBname input output --fasta/--fastq
    Using --fasta if the input is FASTA file, --fastq if the input is FASTQ file.

  • Multi-threading:
    ./CDKAM.sh $DBname input output --fasta/--fastq nthread N
    where N is the number of threads.

  • CDKAM also supports classification in Exact Matching mode:
    ./CDKAM_EM.sh $DBname input output --fasta/--fastq

  • Running translation:
    ./translate $DBname input output
    ,where input is the result of the previous classification process.

6) Output format

Normal mode
(Read ID) (Length of read) (Taxonomy ID)

Example:

  • 1 985 -1
  • 2 733 28116
  • 3 886 590

Translation mode
(Read ID) (Genus taxonomy ID) (Genus taxonomy ID) (AG) Full taxomomic path to Genus level
or
(Read ID) (Genus taxonomy ID) (Species taxonomy ID) (AS) Full taxomomic path to Species level

Example:

  • 1 -1 -1
  • 2 816 28116 (AS) (P) Bacteroidetes | (C) Bacteroidia | (O) Bacteroidales | (F) Bacteroidaceae | (G) Bacteroides | (S) Bacteroides ovatus
  • 3 590 590 (AG) (P) Proteobacteria | (C) Gammaproteobacteria | (O) Enterobacterales | (F) Enterobacteriaceae | (G) Salmonella

(AS) indicates that the read is assigned to a taxonomy ID at Species level
(AG) indicates that the read is assigned to a taxonomy ID at Genus level

7) Testing tool

Compare the classification results between CDKAM and Kraken2. The database contains archaea, viral and fungi.

https://github.com/buikiendp/TestCDKAM

Contact:

If you have any questions, feel free to contact us:
buikien.dp@sjtu.edu.cn
ccwei@sjtu.edu.cn