Skip to content

Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs

License

Notifications You must be signed in to change notification settings

Krannich479/PopIns2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PopIns2

GitHub release GitHub Workflow Status GitHub Issues or Pull Requests GitHub Repo stars DOI

Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs

Contents:

  1. Requirements
  2. Installation
  3. Usage
  4. Example
  5. Snakemake
  6. Help
  7. Citation

Requirements:

Requirement Tested with
64 bits POSIX-compliant operating system Ubuntu 16.04 / 18.04, CentOS Linux 7.6
C++14 capable compiler g++ vers. 4.9.2, 5.5.0, 7.2.0
Bifrost vers. 1.0.4-ab43065
bwa vers. 0.7.15-r1140
samtools vers. 1.3, 1.5
sickle vers. 1.33
gatb-minia-pipeline (submodule; no need to install)
SeqAn (header library; no need to install)

PopIns2 requires an installation of the Bifrost C++ library. Bifrost must be compiled from source with the MAX_KMER_SIZE=64 setting. Do not use the conda package installation of Bifrost as it does not meet the requirement for larger k-mer sizes and performance optimizations. ⚠️ Important: With a release on April 28, 2022, the Bifrost API underwent major changes in the implementation without backward compatibility. Later releases have not been tested thoroughly and might violate the objective of PopIns2. For the time being, please use a Bifrost version prior to commit 703be6d. The Github Actions build workflow provides one possible solution how to reset Bifrost to a tested release.
The other software dependencies (bwa, samtools, sickle) must be accessible system wide, otherwise you have to write the full paths to the executables into the configfile (see Installation). The submodules and header libraries (GATB and SeqAn) come by default with the git clone, there is no need for a manual installation. For backward compatibility with the first major release of PopIns, PopIns2 can optionally use the Velvet assembler (see popins for installation recommendation).

Installation:

Make sure that all binaries of the software dependencies are globally available on your system (e.g. by appending them to your PATH). Alternatively, you can specify the paths to the binaries within the popins2.config prior to executing make. After the compilation with make you should see the binary popins2 in the root folder. The PopIns2 Wiki collects known issues that might occur during installation or runtime.

git clone --recursive https://github.com/kehrlab/PopIns2.git
cd PopIns2
mkdir build
make

Usage:

PopIns2 is a program consisting of several submodules. The submodules are designed to be executed one after another and fit together into a consecutive workflow. To display the help page of a submodule type popins2 <command> --help as shown in the help section.

The assemble command

popins2 assemble [OPTIONS] sample.bam

The assemble command identifies reads without high-quality alignment to the reference genome, filters reads with poor base quality and assembles them into a set of contigs. The reads, given as BAM file, must be indexed by bwa index. Optionally, reads can be remapped to an additional reference FASTA before the filtering and assembly such that only the remaining reads without a high-quality alignment are further processed (e.g. useful for decontamination). The additional reference FASTQ must be indexed by bwa index too.

The merge command

popins2 merge [OPTIONS] {-s|-r} DIR

[Default] The merge command builds a colored and compacted de Bruijn Graph (ccdbg) of all contigs of all samples in a given source directory DIR. By default, the merge module finds all files of the pattern <DIR>/*/assembly_final.contigs.fa. To process the contigs of the assemble command the -r input parameter and graph simplification options -d and -i are highly recommended. Once the ccdbg is built, the merge module identifies paths in the graph and returns supercontigs.

popins2 merge [OPTIONS] -y GFA -z BFG_COLORS

An alternative way of providing input for the merge command is to directly pass a ccdbg. Here, the merge command expects a GFA file and a bfg_colors file, which is specific to the Bifrost. If you choose to run the merge command with a pre-built GFA graph, mind that you have to set the Algorithm options accordingly (in particular -k).

The contigmap command

popins2 contigmap [OPTIONS] SAMPLE_ID

The contigmap command maps all reads with low-quality alignments of a sample to the set of supercontigs using BWA-MEM. The mapping information is then merged with the reads' mates.

The place commands

popins2 place-refalign [OPTIONS]
popins2 place-splitalign [OPTIONS] SAMPLE_ID
popins2 place-finish [OPTIONS]

In brief, the place commands attempt to anker the supercontigs to the samples. At first, all potential anker locations from all samples are collected. Then prefixes/suffixes of the supercontigs are aligned to all collected locations. For successful alignments records are written to a VCF file. In the second step, all remaining locations are split-aligned per sample. Finally, all locations from all successful split-alignments are combined and added to the VCF file.

The genotype command

popins2 genotype [OPTIONS] SAMPLE_ID

The genotype command generates alleles (ALT) of the supercontigs with some flanking reference genome sequence. Then, the reads of a sample are aligned to ALT and the reference genome around the breakpoint (REF). The ratio of alignments to ALT and REF determines a genotype quality and a final genotype prediction per variant per sample.

Example:

Test data for a minimum working example can be found at zenodo. A simple project structure for PopIns2 looks like

$ tree /path/to/your/project/
/path/to/your/project/
├── myFirstSample
│   ├── first_sample.bam
│   └── first_sample.bam.bai
├── mySecondSample
│   ├── second_sample.bam
│   └── second_sample.bam.bai
└── myThirdSample
    ├── third_sample.bam
    └── third_sample.bam.bai

and a simple workflow could look like

cd /path/to/your/project
ln -s /path/to/reference_genome.fa genome.fa
ln -s /path/to/reference_genome.fa.fai genome.fa.fai

popins2 assemble --sample sample1 /path/to/your/project/myFirstSample/first_sample.bam
popins2 assemble --sample sample2 /path/to/your/project/mySecondSample/second_sample.bam
popins2 assemble --sample sample3 /path/to/your/project/myThirdSample/third_sample.bam

popins2 merge -r /path/to/your/project -di

popins2 contigmap sample1
popins2 contigmap sample2
popins2 contigmap sample3

popins2 place-refalign
popins2 place-splitalign sample1
popins2 place-splitalign sample2
popins2 place-splitalign sample3
popins2 place-finish

popins2 genotype sample1
popins2 genotype sample2
popins2 genotype sample3

Snakemake:

The workflow of PopIns2 can be effectively distributed among a HPC cluster environment. This Github project provides a template of a full PopIns2 workflow as individual cluster jobs using Snakemake, a Python-based workflow management tool.

Help:

$ popins2 -h

Population-scale detection of non-reference sequence insertions using colored de Bruijn Graphs
================================================================

SYNOPSIS
    popins2 COMMAND [OPTIONS]

COMMAND
    assemble            Filter, clip and assemble unmapped reads from a sample.
    merge               Generate supercontigs from a colored compacted de Bruijn Graph.
    multik              Multi-k framework for a colored compacted de Bruijn Graph.
    contigmap           Map unmapped reads to (super-)contigs.
    place-refalign      Find position of (super-)contigs by aligning contig ends to the reference genome.
    place-splitalign    Find position of (super-)contigs by split-read alignment (per sample).
    place-finish        Combine position found by split-read alignment from all samples.
    genotype            Determine genotypes of all insertions in a sample.

VERSION
    0.12.0-a935f00, Date: on 2020-10-21 12:50:29

Try `popins2 COMMAND --help' for more information on each command.

For more troubleshooting, FAQs and tips about the usage of PopIns2 please have a look into the PopIns2 Wiki.

Citation:

Thomas Krannich, W Timothy J White, Sebastian Niehus, Guillaume Holley, Bjarni V Halldórsson, Birte Kehr. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2022, 38(3):604–611. https://doi.org/10.1093/bioinformatics/btab749

About

Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 99.4%
  • Makefile 0.6%