In this page, we note down the benchmarking methods and results of some DNA read mapping tools. The tools we use for benchmarking include Bowtie2, BWA-MEM, subread.
We write down how we download the read mapping tools for future reference. We also add script of adding the binary applications to PATH
for easier usage. All tools can be built from source
-
Bowtie2 can be downloaded directly using bioconda.
conda install -c bioconda bowtie2
or from source
git clone https://github.com/BenLangmead/bowtie2.git cd bowtie2; make echo "export PATH=\${PATH}:$(pwd)" >> ~/.bashrc
-
BWA-MEM.
git clone https://github.com/lh3/bwa.git cd bwa; make echo "export PATH=\${PATH}:$(pwd)" >> ~/.bashrc
-
Subread.
git clone https://github.com/ShiLab-Bioinformatics/subread.git cd subread/src make -f Makefile.Linux cd .. echo "export PATH=\${PATH}:$(pwd)/bin" >> ~/.bashrc
-
Minimap2.
git clone https://github.com/lh3/minimap2 cd minimap2 && make echo "export PATH=\${PATH}:$(pwd)" >> ~/.bashrc
After adding everything to .bashrc
, we can run
source ~/.bashrc
to use the binaries by their names directly.
-
GraphMap.
git clone https://github.com/isovic/graphmap.git cd graphmap make modules make echo "export PATH=\${PATH}:$(pwd)/bin/Linux-x64" >> ~/.bashrc
For short reads,
DATASET_NAME=ecoli_simulated
python ~/neat-genreads/gen_reads.py -r GCF_001890245.1_ASM189024v1_genomic.fna -R150 -o ${DATASET_NAME} --bam
# Convert BAM file to SAM file
samtools index ${DATASET_NAME}_golden.bam
samtools view -h -o ${DATASET_NAME}_golden.sam ${DATASET_NAME}_golden.bam
# Decompress fq.gz file
zcat ${DATASET_NAME}_read1.fq.gz >> ${DATASET_NAME}_read1.fq
# Remove spaces in the simulated read names
~/bucket-map/bucket_map/benchmark/rename_reads.sh ${DATASET_NAME}_read1.fq >> ${DATASET_NAME}_read1_renamed.fq
For long reads, we use pbsim.
mkdir lr_simulated; cd lr_simulated
pbsim --strategy wgs --method qshmm --qshmm ~/pbsim3/data/QSHMM-ONT-HQ.model --depth 20 --genome ~/data/mapping/GCA_022991725.1_PDT001286877.1_genomic.fna
To benchmark the time and memory usage, we mainly use Linux's time
command. An example usage is
/usr/bin/time -v bowtie2 -x "${INDEX_INDICATOR}_bowtie" -U ${FASTQ_PATH} -S "${BENCHMARK_PATH}/output/bowtie2_map.sam" &> "${BENCHMARK_PATH}/log/bowtie2_map.log"
Note that we write /usr/bin/time
instead of simply time
to specify the binary we are using. An output is shown below.
Command being timed: "bowtie2 -x egu_bowtie -U /mnt/d/genome/test/sim_illumina_1M.fastq -S /home/zhenhao/bucket-map/bucket_map/benchmark/output/bowtie2_map.sam"
User time (seconds): 852.35
System time (seconds): 1.33
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 14:21.50
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1968900
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 102
Minor (reclaiming a frame) page faults: 8740
Voluntary context switches: 97183
Involuntary context switches: 636
Swaps: 0
File system inputs: 3866056
File system outputs: 1382648
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
We mainly look at the following fields:
User time (seconds)
: the total time used by the program.Maximum resident set size (kbytes)
: peak memory usage of the program.Major (requiring I/O) page faults
,Minor (reclaiming a frame) page faults
: number of page faults caused by the program.
We write down a script that gathers the output of the mappers and the returned log of time
command in benchmark_index.sh
and benchmark_map.sh
. The generated logs and outputs of programs are stored under ./log
and the output sam files under ./output
.
To benchmark accuracy: whether the mapper returns the true location of the read, we use the following metrics.
- Percentage of reads mapped,
- Average number of locations a read is mapped to,
- Percantage of reads mapped to the correct reference (chromosome or contig),
- Percentage of reads mapped to the correct position.
The metrics can be checked directly using the sam file. The benchmarking script is written in sam_file_analyzer.cpp
, which checks all sam files under ./output
.
We primarily use the reference genome of Elaeis guineensis (to be checked) with length 1,701,312,507. The dataset consists of 16 chromosomes and 916 contigs. For all tools, we simply use the default settings.
We record the time and memory performance when producing the index files. Usually this is less important than performance of mapping, as it only requires a one-time run.
Tool | User time (s) | Peak memory usage (Kb) | Major page faults | Minor page faults |
---|---|---|---|---|
Bowtie2 | 2424.85 | 7648524 | 832483 | 7510117 |
BWA-MEM | 1529.78 | 2495668 | 13 | 424721 |
Subread | 633.29 | 3197744 | 8 | 1347347 |
Note: The indexing of Bowtie2 is very memory consuming. We need to make sure that we have enough memory before running the indexing (reboot if possible before running), otherwise it will run into exit status 247
.
The time and memory usage of the tools with 1 million reads of length 300 (with sequencing error rates similar to Illumina) are shown below.
Tool | User time (s) | Peak memory usage (Kb) | Major page faults | Minor page faults |
---|---|---|---|---|
Bowtie2 | 852.35 | 1968900 | 102 | 8740 |
BWA-MEM | 563.19 | 2995356 | 25 | 436960 |
Subread | 573.93 | 4201588 | 13 | 734950 |
BucketMap | 426.78 | 865568 | 0 | 138474 |
Alignment-free BucketMap | 320.95 | 865344 | 0 | 135648 |
And the accuracy of the tools when comparing with the ground truth is
Tool | % reads mapped | Average mapped locations | % correct reference | % correct positions |
---|---|---|---|---|
Bowtie2 | 99.8904% | 1 | 98.4358% | 97.6452% |
BWA-MEM | 100% | 1.00142 | 98.9025% | 98.0892% |
Subread | 96.7519% | 1 | 95.735% | 94.8024% |
BucketMap | 98.6348% | 1.14538 | 97.8566% | 97.3444% |
Alignment-free BucketMap | 98.6348% | 1.14538 | 97.8566% | 97.3444% |