GRASPER: Genome Rearrangement Analysis using Short Paired-End Reads
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
test_data
LICENSE
README
grasper.sh
grasper_multi.sh
makefile

README

GRASPER
Heewook Lee
heewlee@indiana.edu

--------------------------
         SUMMARY
--------------------------

GRASPER (Genome Rearrangement Analysis using Short Paired-End Reads) is a de novo structural variation (SV) calling software that is capable of detecting repetitive SVs. 

It uses (BLAST to A-Bruijn program) to construct A-Bruijn graphs of a given refernece genome to capture approximate repeats (e.g. 95% sequence similarity or higher), then SVs are detected on the graphs. 

GRASPER requires a reference genome sequence in a FASTA formatted file along with a Illumina paired-end sequencing data of a sample genome.

Currently, it supports 

1) Duplicative transposition
2) Deletion of non-repetitive region
3) Deletion of repetitive region
4) Deletion of non-repetitive region bounded by repeats (via homologous recombination)
5) Inversion
6) Tandem-duplication

Unsupported events are still reported in the form of breakpoints. GRASPER first calls breakpoints then assign SV events based on the well known paired SV signatures along with read-depth information. Any breakpoint event without a SV event assignment is reported separately.


--------------------------
      Requirements
--------------------------
To build and run GRASPER, the following are required:

- JDK 1.6 or higher

- Unix-like OS (Linux, Mac OS X, ... )

- Legacy BLAST (available from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ , more information on https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download) We used version 2.2.25 which can be downloaded from ( ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.25/ )

- Burrows-Wheeler Aligner by Heng Li (version 0.7.9 or higher)

- BLAST to A-Bruijn graph package (available from https://github.com/COL-IU/RepGraph )

- Illumina or Illumina-like paired-end reads (whole-genome sequencing)

- a reference genome sequence

- as of v0.1.1, .medMAD file is generated AUTOMATICALLY from RepGraph (v 0.1.1). This file contains meadian and Median Absolute Deviation (MAD) values for library insert size. 1 SD ~ 1.4826 MAD (https://en.wikipedia.org/wiki/Median_absolute_deviation) under normal distribution. This file contains single line of 2 values delimited by a tab.

-------------------------
      Installation
-------------------------

After downloading the GRASPER source distribution and unpacking it, change into the top-level directory:

> cd grasper


Then, compile and create .jar files

> make
 

This will create a new directory "bin" under the grasper directory with the following jar file:

grasper.jar


-------------------------
       Config file
-------------------------
Configuration file contains parameters that GRASPER/RepGraph/BLAST/bwa need.

An example configuration file can be found in "test_data" directory.


-------------------------
      How to run
-------------------------

Although grasper can run as a stand-along program, it first needs A-Bruijn graph representation of reference genome which is generated by RepGraph package as well as SAM formatted alignment of paired-end reads. For this reason, grasper.sh is provided to tie all these dependencies together in a single script. 

Here are the list of commands when running on test_data

1. Move into test_case directory under GRASPER directory
> cd <GRASPER_INSTALLATION_DIR>/test_data

2. Indexing for BLAST and bwa (ONLY needs to be run once for a reference genome)
> ../grasper.sh I example_config.txt

3. Run pair-wise BLASTN on a given reference genome and construct A-Bruijn graphs (ONLY needs to be run once for a reference genome)
> ../grasper.sh G example_config.txt

4. Align via BWA
> ../grasper.sh A example_config.txt 20Insertions_per_element_1TH_pIRS_20X_11_90_470_1.fq.gz 20Insertions_per_element_1TH_pIRS_20X_11_90_470_2.fq.gz

5.Depth Serialization, mid-sroting, discordant pair removal, SV detection
> ./grasper.sh DS example_config.txt

Note that command ADS can be run separately or combined all together. run grasper.sh without any parameters to see more explanation.
> ./grasper.sh

Screen dump of running on test_data can be found on test_data/test_data.screendump

------------------------
        OUTPUT
------------------------
*.thread : A-Bruijn graphs threading information

*.depth : .depth file contains the serialization of depth arrays. 

*.discordant.midsorted : midpoint-sorted SAM file containing only the discordant mappings

*.SV : this file contains the SV calls from GRASPER

-----------------------
       .SV file
-----------------------
2 breakpoint events (TRANSPOSITION or INVERSION) have 23 columns and 1 breakpoint events only have the first 13 columns

*** COLUMNS ***
Column 1 : Event  ( (I) means inverted )
Column 2 : event classifier (internal purpose)

Column 3/5/20/22 : These columns indicate #reads in cluster
Column 4/6/21/23 : These columns indicate # of instances these clusters can map on linear reference. Clusters on graph that are on repetitive paths will have numbers > 1 to indicate their multiplicities.

Column ( 7-8-9 / 10-11-12 / 14-15-16 / 17-18-19 ) : One triplet indicates 5'boundary-3'boundary-ClusteringDirection of a cluster of reads

Column 3-4-7-8-9 indicates single cluster (meaning the boundary and direction is described by columns 7-8-9 and #reads and multiplicity information of this clusters are in columns 3-4.)
Columm 5-6-10-11-12 indicates single cluster.
Column 20-21-14-15-16 indicates single cluster.
Column 22-23-17-18-19 indicates single cluster.

Clusters that cannot be assigned to a specific event are appended at the end under "#		UNASSIGNED CLUSTERS" section.


**** Event Boundaries ***
1) Deletion: Deletion boundaries are roughly defined by [column8, column10] (Direction of clusters : --> <--)

2) Inversion: Inversion boundaries are roughly defined by [column8/column10 , column15/column16] 

3) Transposition: Segment that is being transposed is roughly defined by [column3, column7] (<--- --->) and it's being transposed to the target location, roughly around column15/column16 (---> <---). A midpoint of column 15 and column16 is probably a resonable guess.

4) Tandem duplication: Segment that is being tandemly duplicated is roughly defined by [column7, column11] (Direction: <--- --->)