Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs
Requirement | Tested with |
---|---|
64 bits POSIX-compliant operating system | Ubuntu 16.04 / 18.04, CentOS Linux 7.6 |
C++14 capable compiler | g++ vers. 4.9.2, 5.5.0, 7.2.0 |
Bifrost | vers. 1.0.4-ab43065 |
bwa | vers. 0.7.15-r1140 |
samtools | vers. 1.3, 1.5 |
sickle | vers. 1.33 |
gatb-minia-pipeline | (submodule; no need to install) |
SeqAn | (header library; no need to install) |
PopIns2 requires an installation of the Bifrost C++ library.
Bifrost must be compiled from source with the MAX_KMER_SIZE=64
setting.
Do not use the conda package installation of Bifrost as it does not meet the requirement for larger k-mer sizes and performance optimizations.
The other software dependencies (bwa, samtools, sickle) must be accessible system wide, otherwise you have to write the full paths to the executables into the configfile (see Installation).
The submodules and header libraries (GATB and SeqAn) come by default with the git clone, there is no need for a manual installation.
For backward compatibility with the first major release of PopIns, PopIns2 can optionally use the Velvet assembler (see popins for installation recommendation).
Make sure that all binaries of the software dependencies are globally available on your system (e.g. by appending them to your PATH
).
Alternatively, you can specify the paths to the binaries within the popins2.config prior to executing make
.
After the compilation with make
you should see the binary popins2 in the root folder.
The PopIns2 Wiki collects known issues that might occur during installation or runtime.
git clone --recursive https://github.com/kehrlab/PopIns2.git
cd PopIns2
mkdir build
make
PopIns2 is a program consisting of several submodules. The submodules are designed to be executed one after another and fit together into a consecutive workflow. To display the help page of a submodule type popins2 <command> --help
as shown in the help section.
popins2 assemble [OPTIONS] sample.bam
The assemble command identifies reads without high-quality alignment to the reference genome, filters reads with poor base quality and assembles them into a set of contigs. The reads, given as BAM file, must be indexed by bwa index. Optionally, reads can be remapped to an additional reference FASTA before the filtering and assembly such that only the remaining reads without a high-quality alignment are further processed (e.g. useful for decontamination). The additional reference FASTQ must be indexed by bwa index too.
popins2 merge [OPTIONS] {-s|-r} DIR
[Default] The merge command builds a colored and compacted de Bruijn Graph (ccdbg) of all contigs of all samples in a given source directory DIR.
By default, the merge module finds all files of the pattern <DIR>/*/assembly_final.contigs.fa
. To process the contigs of the assemble command the -r input parameter and graph simplification options -d and -i are highly recommended. Once the ccdbg is built, the merge module identifies paths in the graph and returns supercontigs.
popins2 merge [OPTIONS] -y GFA -z BFG_COLORS
An alternative way of providing input for the merge command is to directly pass a ccdbg. Here, the merge command expects a GFA file and a bfg_colors file, which is specific to the Bifrost. If you choose to run the merge command with a pre-built GFA graph, mind that you have to set the Algorithm options accordingly (in particular -k).
popins2 contigmap [OPTIONS] SAMPLE_ID
The contigmap command maps all reads with low-quality alignments of a sample to the set of supercontigs using BWA-MEM. The mapping information is then merged with the reads' mates.
popins2 place-refalign [OPTIONS]
popins2 place-splitalign [OPTIONS] SAMPLE_ID
popins2 place-finish [OPTIONS]
In brief, the place commands attempt to anker the supercontigs to the samples. At first, all potential anker locations from all samples are collected. Then prefixes/suffixes of the supercontigs are aligned to all collected locations. For successful alignments records are written to a VCF file. In the second step, all remaining locations are split-aligned per sample. Finally, all locations from all successful split-alignments are combined and added to the VCF file.
popins2 genotype [OPTIONS] SAMPLE_ID
The genotype command generates alleles (ALT) of the supercontigs with some flanking reference genome sequence. Then, the reads of a sample are aligned to ALT and the reference genome around the breakpoint (REF). The ratio of alignments to ALT and REF determines a genotype quality and a final genotype prediction per variant per sample.
Test data for a minimum working example can be found at zenodo. A simple project structure for PopIns2 looks like
$ tree /path/to/your/project/
/path/to/your/project/
├── myFirstSample
│ ├── first_sample.bam
│ └── first_sample.bam.bai
├── mySecondSample
│ ├── second_sample.bam
│ └── second_sample.bam.bai
└── myThirdSample
├── third_sample.bam
└── third_sample.bam.bai
and a simple workflow could look like
cd /path/to/your/project
ln -s /path/to/reference_genome.fa genome.fa
ln -s /path/to/reference_genome.fa.fai genome.fa.fai
popins2 assemble --sample sample1 /path/to/your/project/myFirstSample/first_sample.bam
popins2 assemble --sample sample2 /path/to/your/project/mySecondSample/second_sample.bam
popins2 assemble --sample sample3 /path/to/your/project/myThirdSample/third_sample.bam
popins2 merge -r /path/to/your/project -di
popins2 contigmap sample1
popins2 contigmap sample2
popins2 contigmap sample3
popins2 place-refalign
popins2 place-splitalign sample1
popins2 place-splitalign sample2
popins2 place-splitalign sample3
popins2 place-finish
popins2 genotype sample1
popins2 genotype sample2
popins2 genotype sample3
The workflow of PopIns2 can be effectively distributed among a HPC cluster environment. This Github project provides a template of a full PopIns2 workflow as individual cluster jobs using Snakemake, a Python-based workflow management tool.
$ popins2 -h
Population-scale detection of non-reference sequence insertions using colored de Bruijn Graphs
================================================================
SYNOPSIS
popins2 COMMAND [OPTIONS]
COMMAND
assemble Filter, clip and assemble unmapped reads from a sample.
merge Generate supercontigs from a colored compacted de Bruijn Graph.
multik Multi-k framework for a colored compacted de Bruijn Graph.
contigmap Map unmapped reads to (super-)contigs.
place-refalign Find position of (super-)contigs by aligning contig ends to the reference genome.
place-splitalign Find position of (super-)contigs by split-read alignment (per sample).
place-finish Combine position found by split-read alignment from all samples.
genotype Determine genotypes of all insertions in a sample.
VERSION
0.12.0-a935f00, Date: on 2020-10-21 12:50:29
Try `popins2 COMMAND --help' for more information on each command.
For more troubleshooting, FAQs and tips about the usage of PopIns2 please have a look into the PopIns2 Wiki.
Thomas Krannich, W Timothy J White, Sebastian Niehus, Guillaume Holley, Bjarni V Halldórsson, Birte Kehr. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2022, 38(3):604–611. https://doi.org/10.1093/bioinformatics/btab749