Skip to content

Latest commit

 

History

History

test

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Reproducing the results in the paper

Prerequisites

We compare BLEND with Minimap2 (for finding overlapping reads and read mapping), MHAP (for finding overlapping reads), LRA (for read mapping), Winnowmap (for read mapping), and S-conLSH (for read mapping). We specify the versions we use for each tool in our submission in Supplementary Table S3.

We list the links to download and compile each tool for comparison below:

We use various tools to process and analyze the data we generate using each tool. The following tools must also be installed in your machine:

We suggest using conda to install these tools with their specified versions as almost all of them are included in the conda repository.

Please make sure that all of these tools are in your PATH

Datasets

All the datasets (except the real human genome datasets) can be downloaded via Zenodo. For the human genome datasets, we use its already available repositories to download. We provide the scripts to download all these files under data directory. In order to download:

cd data

bash download-e.coli-pb-sequelii.sh #E. coli PacBio HiFi Dataset (100X)
bash download-e.coli-pb-rs.sh #E. coli PacBio CLR Dataset (112X)
bash download-yeast-pb-pbsim2.sh #Simulated Yeast PacBio CLR Simulated Dataset (200X)
bash download-yeast-ont-pbsim2.sh #Simulated Yeast ONT Simulated Dataset (100X)
bash download-yeast-illumina.sh #Yeast Illumina Dataset (80X)
bash download-d.ananassae-pb-sequelii.sh #D. ananassae PacBio HiFi Dataset (50X)
bash download-chm13-pb-sequelii-16X.sh #Human CHM13 PacBio HiFi Dataset (16X)
bash download-chm13-ont-pbsim2.sh #Simulated Human CHM13 ONT R9.5 Dataset (30X)
bash download-hg002-pb-sequelii-52X.sh #Human HG002 Dataset (52X)

cd .. #going back to the test directory

Now that you have downloaded all the datasets, we can start running all the tools to collect the results.

Finding Overlapping Reads

We run each tool for finding overlapping reads for each dataset. We use 32 threads for each tool but you can easily change this number in the scripts we use below. We also provide the scripts to run the tools with SLURM. Please modify the SLURM_OPTIONS variable inside these scripts according to your system configuration. If you want to run the tools via SLURM, please use the corresponding scripts with the _slurm.sh prefix instead of the scripts we mention below.

BLEND

Here we run BLEND to find overlapping reads for each dataset.

cd blend
bash overlap.sh

#The following applies to all the scripts we mention below for running the tools:
#If you want to run this script via SLURM, please edit SLURM_OPTIONS in overlap_slurm.sh and run the following command.
#bash overlap_slurm.sh
cd ..

Minimap2

Here we run Minimap2 to find overlapping reads for each dataset.

cd minimap2
bash overlap.sh
cd ..

MHAP

Here we run MHAP to find overlapping reads for each dataset.

cd mhap
bash overlap.sh
cd ..

Read Mapping

We run each tool to perform read mapping for each dataset. We use 32 threads for each tool and 8 threads for sorting the BAM files but you can easily change this number in the scripts we use below.

BLEND

Here we run BLEND to perform read mapping for each dataset.

cd blend
bash read_mapping.sh

#The following applies to all the scripts we mention below for running the tools:
#If you want to run this script via SLURM, please edit SLURM_OPTIONS in read_mapping_slurm.sh and run the following command.
#bash read_mapping_slurm.sh
cd ..

Minimap2

Here we run Minimap2 to perform read mapping for each dataset.

cd minimap2
bash read_mapping.sh
cd ..

Winnowmap

Here we run Winnowmap to perform read mapping for each dataset.

cd winnowmap
bash read_mapping.sh
cd ..

LRA

Here we run LRA to perform read mapping for each dataset.

cd lra
bash read_mapping.sh
cd ..

S-conLSH

Here we run S-conLSH to perform read mapping for each dataset.

cd conlsh
bash read_mapping.sh
cd ..

Strobealign

Here we run Strobealign to perform read mapping for each dataset.

cd strobealign
bash read_mapping.sh
cd ..

Comparing the Results

We have already created symbolic links for all the files that each tool would generate under the comparisons directory. The subdirectories under comparisons are created for each dataset.

We provide the scripts we use for evaluating the results under the scripts directory. Each script is also already linked in the relevant directories so that one could easily run the correct scripts to evaluate the results easily.

We also provide the python code, raw files, and scripts we use for generating the figures we show in the paper. You can change the raw files (i.e., csv files) based on the results newly generated (e.g., performance results). These files are all included in the figures directory. We explain the subdirectories in more detail below.

Below we explain how to evaluate the results we generate for 1) finding overlapping reads and 2) read mapping.

Finding Overlapping Reads

Each of the subdirectories under comparisons include a directory called overlap where we include all the files to compare for a certain dataset. Each overlap directory includes the 1_generate_results.sh and summarize.sh scripts. Running 1_generate_results.sh generates and outputs all the corresponding evaluation results. If you want to see the results without re-running 1_generate_results.sh, you can run summarize.sh to output the results. We explain how to evaluate the finding overlapping results for each dataset below.

  • E. coli PacBio HiFi Dataset (100X)
cd comparisons/e.coli-pb-sequelii/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • E. coli PacBio CLR Dataset (112X)
cd comparisons/e.coli-pb-rs/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Simulated Yeast PacBio CLR Simulated Dataset (200X)
cd comparisons/yeast-pb-pbsim-200x/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Simulated Yeast ONT Simulated Dataset (100X)
cd comparisons/yeast-ont-pbsim-100x/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • D. ananassae PacBio HiFi Dataset (50X)
cd comparisons/d.ananassae-pb-sequelii/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Human CHM13 PacBio HiFi Dataset (16X)
cd comparisons/chm13-pb-sequelii-16X/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Simulated Human CHM13 ONT R9.5 Dataset (30X)
cd comparisons/chm13-ont-pbsim2/overlap

#Runs dnadiff and quast to generate all the results regarding the assembly and overlap statistics. It outputs the result at the end to the standard output.
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../

Read Mapping

Each of the subdirectories under comparisons include a directory called read_mapping where we include all the files to compare for a certain dataset. Each read_mapping directory includes the 1_generate_results.sh and summarize.sh scripts. Running 1_generate_results.sh generates and outputs all the corresponding evaluation results. If you want to see the results without re-running 1_generate_results.sh, you can run summarize.sh to output the results. We explain how to evaluate the finding overlapping results for each dataset below. We explain how to evaluate the read mapping results for each dataset below.

  • E. coli PacBio HiFi Dataset (100X)
cd comparisons/e.coli-pb-sequelii/read_mapping

#Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • E. coli PacBio CLR Dataset (112X)
cd comparisons/e.coli-pb-rs/read_mapping

#Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Simulated Yeast PacBio CLR Simulated Dataset (200X)
cd comparisons/yeast-pb-pbsim-200x/read_mapping

#This script has multiple phases:
#1) Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
#2) Generates the data used for evaluating the read mapping accuracy. Precision results are calculated based on the
#number of reads that map to the chromosome in the 1) resulting BAM/SAM files and 2) in the ground truth

bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Simulated Yeast ONT Simulated Dataset (100X)
cd comparisons/yeast-ont-pbsim-100x/read_mapping

#This script has multiple phases:
#1) Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
#2) Generates the data used for evaluating the read mapping accuracy. Precision results are calculated based on the
#number of reads that map to the chromosome in the 1) resulting BAM/SAM files and 2) in the ground truth

bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Yeast Illumina Dataset (80X)
cd comparisons/yeast-illumina/read_mapping

#Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • D. ananassae PacBio HiFi Dataset (50X)
cd comparisons/d.ananassae-pb-sequelii/read_mapping

#Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Human CHM13 PacBio HiFi Dataset (16X)
cd comparisons/chm13-pb-sequelii-16X/read_mapping

#Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Simulated Human CHM13 ONT R9.5 Dataset (30X)
cd comparisons/chm13-ont-pbsim2/read_mapping

#This script has multiple phases:
#1) Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
#2) Generates the data used for evaluating the read mapping accuracy. Precision results are calculated based on the
#number of reads that map to the chromosome in the 1) resulting BAM/SAM files and 2) in the ground truth

bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../
  • Human HG002 Dataset (52X)
cd comparisons/hg002-pb-ccs-52X/read_mapping

#Runs BamUtil, bedtools, samtools, deepTools, and mosdepth to generate the statistics based on the bam files
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh

cd ../../../

Structural Variant Calling

We use sniffles to call structural variants (SVs) from the BAM files that each tool generates from the read mapping step. We use the Human HG002 dataset and compare the SVs to the HG002 Tier 1 SV Truth Set released by GIAB. We explain how to evaluate the SV results.

  • Human HG002 Dataset (52X)
cd comparisons/hg002-pb-ccs-52X/sv_calling

#Runs truvari to compare the truth SV sets to the VCFs generated from the BAM files of each tool
bash 1_generate_results.sh

#If you want to output the results after successful completion of 1_generate_results.sh, you can run the following command
bash summarize.sh


cd ../../../

Figures

The figures directory includes two subdirectories for the applications that BLEND is evaluated: 1) finding overlapping reads and 2) read mapping. The directory also includes fuzzy_seed_matching that we use when evaluating the fuzzy seed matching statistics of BLEND. All last-level subdirectories include a script called run.sh to generate the figure in PDF format given that all the results are generated previously and the corresponding CSV files are filled accordingly. The CSV files we provide include the results we generate. These CSV files should be correctly updated if you re-run BLEND and other tools using the commands we provide above.

Python 3 and the numpy, pandas, matplotlib, gnuplot, and seaborn packages are suggested to generate these figures.

Generating the Fuzzy Seed Matching Statistics Figure (Figure 7)

#Overlap statistics (Figure 7)
cd figures/fuzzy_seed_matching/
bash run.sh

cd ../../../

Generating the Performance and Peak Memory Usage Figures

#Finding overlapping reads (Figure 8)
cd figures/overlap/perf/
bash run.sh

cd ../
#Finding overlapping reads between BLEND-I and BLEND-S (Supp. Figure S3)
cd ./perf-blend/
bash run.sh

cd ../
#Finding overlapping reads between BLEND-I, minimap2, and minimap2-Eq (Supp. Figure S5)
cd ./perf-eq/
bash run.sh

cd ../../../

#Read mapping (Figure 10)
cd figures/read_mapping/perf/
bash run.sh

cd ../
#Read mapping between BLEND-I and BLEND-S (Supp. Figure S4)
cd figures/read_mapping/perf-blend/
bash run.sh

cd ../../../

Generating the Overlap Statistics Figure (Figure 9)

#Overlap statistics (Figure 9)
cd figures/overlap/overlap_stats/
bash run.sh

cd ../../../

Generating the Read Mapping Accuracy Figure (Figure 11)

#Read mapping accuracy (Figure 11)
cd figures/read_mapping/accuracy/
bash run.sh

cd ../../../

Generating the Genome-wide Coverage Comparison Figures (Supp. Figures S6, S7, and S8)

#Genome-wide Coverage Comparison Figures (Supp. Figures S6, S7, and S8)
cd figures/read_mapping/coverage/
bash run.sh

cd ../../../

Source of Figures

We provide the original PDFs we use in our paper. The source and the PDFs can be extracted from the PPTX files we provide in the fuzzy_seed_matching, overlap and read mapping directories. If new PDFs are generated using the commands above, the new PDFs can be replaced by the PDFs we have in the PPTX files to fully generate the original figures we have in our manuscript.