Skip to content
Graph realignment tools for structural variants
Branch: master
Clone or download
Latest commit bab03b8 May 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data GT-743 v2.2 release May 14, 2019
doc GT-743 v2.2 release May 14, 2019
external
share GT-743 v2.2 release May 14, 2019
src GT-743 v2.2 release May 14, 2019
.clang-format initial commit Nov 24, 2017
.dockerignore Paragraph v1.1 release Feb 26, 2018
.gitignore v2.0 release Jun 28, 2018
.pylintrc Paragraph v1.1 release Feb 26, 2018
.travis.yml Add travis.yml (3) Jul 7, 2018
.ycm_extra_conf.py squash into one commit: pass Boost options to graph-tools and add ful… Aug 10, 2018
CMakeLists.txt GT-743 v2.2 release May 14, 2019
Dockerfile GT-743 v2.2 release May 14, 2019
LICENSE v2.0 release Jun 28, 2018
README.md GT-743 v2.2 release May 14, 2019
RELEASES.md GT-743 v2.2 release May 14, 2019
configure v2.0 release Jun 28, 2018
requirements.txt initial commit Nov 24, 2017
setup.cfg initial commit Nov 24, 2017

README.md

Paragraph: a suite of graph-based genotyping tools

Introduction

Accurate genotyping of known variants is a critical for analysis of whole-genome sequencing data.

Paragraph aims to facilitate these tasks by providing:

  • an accurate genotyper for Structural Variations in short-read data
  • a suite of graph-based tools to align and genotype complex events

Please reference Paragraph using:

Variant calls described in the paper is available at data/download-instructions.txt

System Requirements

Hardware

A standard workstation with at least 8GB of RAM should be sufficient for compilation and testing of the program.

It typically takes up to a few seconds to genotype a single event in one sample (single-threaded). We provide wrapper scripts to parallelize this process. It took us 30 minutes to genotype ~20,000 SVs using 20 CPU cores (with I/O).

Operating systems

Paragrpah is supported on the following systems:

  • Ubuntu 16.04 and CentOS 5-7,
  • macOS 10.11+,

Python 3.4+ is required.

We recommend using g++ (6.0+), or a recent version of Clang.

We use the C++11 standard, any Posix compliant compiler supporting this standard should be usable.

Third-party libraries

The following Python modules are required:

  • Pysam
  • Intervaltree
  • Jsonschema

Boost libraries version >= 1.5 is required.

  • We prefer to statically link Boost libraries to Paragraph executables:

    cd ~
    wget http://downloads.sourceforge.net/project/boost/boost/1.65.0/boost_1_65_0.tar.bz2
    tar xf boost_1_65_0.tar.bz2
    cd boost_1_65_0
    ./bootstrap.sh
    ./b2 --prefix=$HOME/boost_1_65_0_install link=static install
  • To point Cmake to your version of Boost use the BOOST_ROOT environment variable:

    export BOOST_ROOT=$HOME/boost_1_65_0_install
    # Now run cmake + build as shown below.

We have included copies of other dependent libraries in external/. They are:

  • Google Test and Google Mock (v1.8.0)
  • Htslib (v1.9)
  • Spdlog

Installation

Native buid

First, checkout the repository like so:

git clone https://github.com/Illumina/paragraph.git
cd paragraph-tools

Then create a new directory for the program and compile it there:

# Create a separate build folder.
cd ..
mkdir paragraph-tools-build
cd paragraph-tools-build

# Configure
# optional:
# export BOOST_ROOT=<path-to-boost-installation>
cmake ../paragraph-tools

# Make, use -j <n> to use n parallel jobs to build, e.g. make -j4
make

From Docker Image

We also provide a Dockerfile. To build a Docker image, run the following command inside the source checkout folder:

docker build .

Once the image is built you can find out its ID like this:

docker images
REPOSITORY                             TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
<none>                                 <none>              259aa8c0c920        10 minutes ago      2.18 GB

Check the below section for how to run Paragraph, and execute this before running:

sudo docker run -v `pwd`:/data 259aa8c0c920

The current directory can be accessed as /data inside the Docker container, see also https://docs.docker.com/engine/reference/commandline/run/.

To override the default entrypoint run the following command to get an interactive shell in which the paragraph tools can be run:

sudo docker run --entrypoint /bin/bash -it 259aa8c0c920

Run Paragraph from VCF

Example

After installation, run multigrmpy.py script from the build/bin directory on an example dataset as follows:

python3 bin/multigrmpy.py -i share/test-data/round-trip-genotyping/candidates.vcf \
                          -m share/test-data/round-trip-genotyping/samples.txt \
                          -r share/test-data/round-trip-genotyping/dummy.fa \
                          -o test \

This runs a simple genotyping example for two test samples.

  • candidates.vcf: this specifies candidate SV events in a vcf format.
  • samples.txt: Manifest that specifies some test BAM files. Tab delimited.
  • dummy.fa a short dummy reference which only contains chr1

The output folder test then contains gzipped json for final genotypes:

$ tree test
test
├── grmpy.log            #  main workflow log file
├── genotypes.vcf.gz     #  Output VCF with individual genotypes
├── genotypes.json.gz    #  More detailed output than genotypes.vcf.gz
├── variants.vcf.gz      #  The input VCF with unique ID from Paragraph
└── variants.json.gz     #  The converted graphs from input VCF (no genotypes)

If successful, the last 3 lines of genotypes.vcf.gz will the same as in expected file.

Input requirements

VCF format

Paragraph will independently genotype each entry of the input VCF. You can use either indel-style representation (full REF and ALT allele sequence in 4th and 5th columns) or symbolic alleles, as long as they meet the format requirement of VCF 4.0+.

Currently we support 4 symbolic alleles:

  • <DEL> for deletion
    • Must have END key in INFO field.
  • <INS> for insertion
    • Must have a key in INFO field for insertion sequence (without padding base). The default key is SEQ.
    • For blockwise swap, we strongly recommend using indel-style representation, other than symbolic alleles.
  • <DUP> for duplication
    • Must have END key in INFO field. Paragraph assumes the sequence between POS and END being duplicated for one more time in the alternative allele.
  • <INV> for inversion
    • Must have END key in INFO field. Paragraph assumes the sequence between POS and END being reverse-complemented in the alternative allele.

Sample Manifest

Must be tab-deliemited.

Required columns:

  • ID: Each sample must have a unique ID. The output VCF will include genotypes for all samples in the manifest
  • path: Path to the BAM/CRAM file.
  • depth: Average depth across the genome. Can be calculated with bin/idxdepth or samtools.
  • read length: Average read length (bp) across the genome.

Optional columns:

  • depth sd: Specify standard deviation for genome depth. Used for the normal test of breakpoint read depth. Default is sqrt(5*depth).

  • depth variance: Square of depth sd.

  • sex: Affects chrX and chrY genotyping. Allow "male" or "M", "female" or "F", and "unknown" (quotes shouldn't be included in the manifest). If not specified, the sample will be treated as unknown.

Run Paragraph on complex variants

For more complicated events (e.g. genotype a deletion together with its nearby SNP), you can provide a custimized JSON to Paragraph:

Please follow the pattern in example JSON and make sure all required keys are provided. Here is a visualization of this sample graph.

To obtain graph alignments for this graph (including all reads), run:

bin/paragraph -b <input BAM> \
              -r <reference fasta> \
              -g <input graph JSON> \
              -o <output JSON path> \
              -E 1

To obtain the algnment summary, genotypes of each breakpoint, and the whole graph, run:

bin/grmpy -m <input manifest> \
          -r <reference fasta> \
          -i <input graph JSON> \
          -o <output JSON path> \
          -E 1

If you have multiple events listed in the input JSON, multigrmpy.py can help you to run multiple grmpy jobs together.

Further Information

Documentation

External links

  • The Illumina/Polaris repository gives the short-read sequencing data we used to test our method in population.

License

The LICENSE file contains information about libraries and other tools we use, and license information for these.

Paragraph itself is distributed under the simplified BSD license. The full license text can be found at: https://github.com/Illumina/licenses/blob/master/Simplified-BSD-License.txt

You can’t perform that action at this time.