CDKAM

Copyright 2019-2020
Author: Van-Kien Bui, Chaochun Wei
Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University

1) Introduction

Classification tool using Discriminative K-mers and Approximate Matching algorithm (CDKAM) is a new metagenome sequence classification tool for the third generation sequencing data with high error rates.

2) Requirements

Linux operation system
Memory: 70 GB
Disk space: 200 GB
Perl 5.8.5 (or up) and GCC 4.8.5 (or up).
Dustmasker https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/.

It is suggested to install BLAST+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/), which has already included dustmasker. Low-complexity sequences, e.g. "ACACACACACACACACA", "ATATATATATATATATAT" are known to occur in many different organisms and are typically less informative for use in alignments. The masked regions are not processed further by CDKAM.

3) Datasets

Datasets can be found at OneDrive:

The first simulated dataset https://1drv.ms/u/s!AvI5WFKEnJrGeQlkB-KTexns4m8?e=lxVWKy
The second simulated dataset https://1drv.ms/u/s!AvI5WFKEnJrGeu0OzwT1556LlG0?e=ThGos7
The third simulated dataset https://1drv.ms/u/s!AvI5WFKEnJrGe6q7R76aKHVx29k?e=bg2ogf
The fourth simulated dataset https://1drv.ms/u/s!AvI5WFKEnJrGfDYHCWoOfBN06zs?e=g3m7Zz
A sample of sequencing Nanopore MinION data https://www.st-va.ncbi.nlm.nih.gov/bioproject/PRJNA493153
Zymo mock dataset: https://github.com/LomanLab/mockcommunity

4) Installation

Firstly, download the package of the latest CDKAM release: https://github.com/SJTU-CGM/CDKAM
Then, go in the extracted sub-directory "CDKAM". Then:
$ ./install.sh

5) Running CDKAM

It might take 6-8 hours for downloading the reference genomes (about 85 GB), and 24-30 hours for building the database.
To build CDKAM with different value of X, please change the value of variable RANGE in /src/compress.cpp file as follows:
RANGE = 5 for X = 20%
RANGE = 7 for X = 15%
RANGE = 10 for X = 10%
RANGE = 20 for X = 5%
The default version of CDKAM selects X = 15% with RANGE = 7.

Downloading database:
Standard installation with archaea, bacteria and viral reference genomes
./download --standard --db $DBname
Or custom installation
mkdir $DBname
./download_taxonomy.sh $DBname
./download --download-library archaea --db $DBname
./download --download-library bacteria --db $DBname
./download --download-library fungi --db $DBname
./download --download-library viral --db $DBname
./download --download-library human --db $DBname
Building database:
./build_database.sh $DBname
Running classification by default (using approximate matching strategies):
./CDKAM.sh $DBname input output --fasta/--fastq
Using --fasta if the input is FASTA file, --fastq if the input is FASTQ file.
Multi-threading:
./CDKAM.sh $DBname input output --fasta/--fastq nthread N
where N is the number of threads.
CDKAM also supports classification in Exact Matching mode:
./CDKAM_EM.sh $DBname input output --fasta/--fastq
Running translation:
./translate $DBname input output
,where input is the result of the previous classification process.

6) Output format

Normal mode
(Read ID) (Length of read) (Taxonomy ID)

Example:

1 985 -1
2 733 28116
3 886 590

Translation mode
(Read ID) (Genus taxonomy ID) (Genus taxonomy ID) (AG) Full taxomomic path to Genus level
or
(Read ID) (Genus taxonomy ID) (Species taxonomy ID) (AS) Full taxomomic path to Species level

Example:

1 -1 -1
2 816 28116 (AS) (P) Bacteroidetes | (C) Bacteroidia | (O) Bacteroidales | (F) Bacteroidaceae | (G) Bacteroides | (S) Bacteroides ovatus
3 590 590 (AG) (P) Proteobacteria | (C) Gammaproteobacteria | (O) Enterobacterales | (F) Enterobacteriaceae | (G) Salmonella

(AS) indicates that the read is assigned to a taxonomy ID at Species level
(AG) indicates that the read is assigned to a taxonomy ID at Genus level

7) Testing tool

Compare the classification results between CDKAM and Kraken2. The database contains archaea, viral and fungi.

https://github.com/buikiendp/TestCDKAM

Contact:

If you have any questions, feel free to contact us:
buikien.dp@sjtu.edu.cn
ccwei@sjtu.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
example		example
src		src
CDKAM.sh		CDKAM.sh
CDKAM_EM.sh		CDKAM_EM.sh
README.md		README.md
build_database.sh		build_database.sh
download		download
download_genomic_library.sh		download_genomic_library.sh
download_taxonomy.sh		download_taxonomy.sh
install.sh		install.sh
rsync_from_ncbi.pl		rsync_from_ncbi.pl
scan_fasta_file.pl		scan_fasta_file.pl
standard_installation.sh		standard_installation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDKAM

About

Releases 2

Packages

Languages

SJTU-CGM/CDKAM

Folders and files

Latest commit

History

Repository files navigation

CDKAM

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages