GitHub - pratas/smash: A tool to find and visualize rearrangements in DNA sequences

SMASH is a completely alignment-free method to find and visualise rearrangements between pairs of DNA sequences. The detection is based on relative compression, namely using a FCM, also known as Markov model, of high context order (typically 20). The method has been approached with a tool (also called SMASH). For visualization, SMASH outputs a SVG image, with an ideogram output architecture, where the patterns are represented with several HSV values (only value varies). The following image, illustrating the information maps between human and chimpanzee for the several chromosomes, depicts an example:

1. INSTALLATION

We provide a binary for each 64 bits operating systems (Linux, OSX, Windows). However, for other purposes, such as source code compilation, use the following installation instructions. Cmake is needed for installation (http://www.cmake.org/). You can download it directly from http://www.cmake.org/cmake/resources/software.html or use an appropriate packet manager. In the following instructions we show the procedure to install, compile and create the information map between human and orangutan chromosome 20:

1.1 STEP 1

Download, install smash and resolve conflicts.

1.1.1 Linux

sudo apt-get install cmake
git clone https://github.com/pratas/smash.git
cd smash/src/
cmake .
make
cp SMASH ../../
cd ../../

Alternatively, you can install (without cmake and only for linux) using

sudo apt-get install cmake
git clone https://github.com/pratas/smash.git
cd smash/src/
cp Makefile.nix Makefile
make
cp SMASH ../../
cd ../../

1.1.2 OS X

Install brew:

ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"

only if you do not have it. After type:

brew install cmake
brew install wget
brew install gcc48
wget https://github.com/pratas/smash/archive/master.zip
unzip master.zip
cd smash-master/src/
cmake .
make

With some versions you might need to create a link to cc or gcc (after the brew install gcc48 command), namely

sudo mv /usr/bin/gcc /usr/bin/gcc-old   # gcc backup
sudo mv /usr/bin/cc /usr/bin/cc-old     # cc backup
sudo ln -s /usr/bin/gcc-4.8 /usr/bin/gcc
sudo ln -s /usr/bin/gcc-4.8 /usr/bin/cc

In some versions, the gcc48 is installed over /usr/local/bin, therefore you might need to substitute the last two commands by the following two:

sudo ln -s /usr/local/bin/gcc-4.8 /usr/bin/gcc
sudo ln -s /usr/local/bin/gcc-4.8 /usr/bin/cc

1.1.3 Windows

In windows use cygwin (https://www.cygwin.com/) and make sure that it is included in the installation: cmake, make, zcat, unzip, wget, tr, grep (and any dependencies). If you install the complete cygwin packet then all these will be installed. After, all steps will be the same as in Linux.

1.2 Step 2

Download the sequences [the links might change over time].

1.2.1 Linux and OS X

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38_chr20.fa.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Pongo_abelii/Assembled_chromosomes/seq/pab_ref_P_pygmaeus_2.0.2_chr20.fa.gz

1.3 Step 3

Unzip, exclude headers and filter content.

1.3.1 Linux

zcat hs_ref_GRCh38_chr20.fa.gz | grep -v ">" | tr -d -c "ACGTN" > HS20
zcat pab_ref_P_pygmaeus_2.0.2_chr20.fa.gz | grep -v ">" | tr -d -c "ACGTN" > PA20

1.3.2 OS X

gzcat hs_ref_GRCh38_chr20.fa.gz | grep -v ">" | tr -d -c "ACGTN" > HS20
gzcat pab_ref_P_pygmaeus_2.0.2_chr20.fa.gz | grep -v ">" | tr -d -c "ACGTN" > PA20

1.4 Step 4

Run SMASH.

1.4.1 Linux and OS X

./SMASH -v -c 20 -t 1.5 HS20 PA20

This step outputs a SVG image using the respective map under the name: HS20PA20.svg (for custom name use option: -o ANYNAME.svg). The respective information map is the following (transformed in a png image and rotated 90 degrees to left):

2 USAGE

The SMASH program have many options in the interface because there are a wide variety of parameters that can be defined by the user. However, for the detection of the arrangements only two are critical, namely context and threshold. Mathematical information about these parameters can be found in the paper.

To see the possible options type

./SMASH

or

./SMASH -h

These will print the following options:

Usage: SMASH <OPTIONS>... [FILE] [FILE]      

 -h                  give this help,                 
 -V                  display version number,
 -v                  verbose mode,            
 -f                  force (be sure!),        
                                             
 -c  <context>       context order (DEF: 20),
 -t  <threshold>     threshold [0.0,2.0] (DEF: 1.5),
 -m  <mSize>         minimum block size (DEF: 1000000),
 
 -i                  do not show inversions,
 -n                  do not show regulars,
 -r  <ratio>         image size ratio (MaxSeq/150),
 -a  <alpha>         alpha estimator (DEF: 1000),
 -s  <seed>          seed for random 'N',     
 -w  <wSize>         window size,
 -wt <wType>         window type [0|1|2|3] (DEF: 0),
 -d  <dSize>         sub-sample (DEF: 10000),
 -nd                 do not delete temporary files,
 -wi <width>         sequence width (DEF: 25),
                                             
 -p  <posFile>       output positions file,
 -o  <outFile>       output svg plot file,    
                                             
 [refFile]           reference file,          
 [tarFile]           target file.

By default, SMASH has many parameters assigned in order to avoid the estimation, enabling only to set both reference and target files. However, these defaults are only estimated to detect rearrangements in primates. Therefore, for other purposes you might need to adjust context and threshold parameters. Moreover, for custom image maps you might also need to set other parameters, such as width and/or ratio (scale). Only [refFile] and [TarFile] are mandatory.

2.1 Options meaning

Parameters	Meaning
-h	It will print the parameters menu (help menu)
-V	It will print the smash version number, license type and authors.
-v	It will print progress information such as positions of the patterns, times, etc.
-f	It will force to write over files already created.
-c <context>	Size of the FCM (Markov) context order (interval [1;28]). Contexts above 14 will be handled with a hash-table, where the implementation is approximately linear in memory relatively to the size of the sequence. When the sequence is very fragmented, or the species are somehow distant, or the sequencing/assembly process has low quality this value show not be very high.
-t <threshold>	It will be used to segment the high from the low regions of information content (interval [0;2]). For distant species this value might be slightly below 2 (such as 1.9).
-m <mSize>	Minimum size of the block considered as a valid patters after each segmentation process. Values below 1 Million for primate chromosomes might emerge excessive valid patterns. However for other purposes, such as gene scale analysis, this value should be set almost to 1.
-i	It will not show the information map regarding to inverted pattern regions.
-n	It will not show the information map regarding to regular pattern regions (normal regions).
-r <ratio>	Sets the ratio size of the image. Currently is fixed to 1000000 which is an estimated value to the medium of the primate chromosomes sizes relatively to the medium of the screen resolution. This parameter is not automatically adaptad since a fixed value will bring different size chromosomes to the same scale. Nevertheless, to use it in small sequences, namely bacterial genomes, this parameter might be adjusted.
-a <alpha>	Probabilities estimator. This value relates a linear interpolation between the maximum likelihood estimator and the uniform distribution. This also shows that when the total number of events is large, the estimator behaves as a maximum likelihood estimator. Default value is set to 1000.
-s <seed>	This is a parameter to the pseudo-random generation function. Different seed values ensure different generated values.
-w <wSize>	The window size among with the sub-sampling is calculated automatically, nevertheless this value might be adjusted for special needs.
-wt <wType>	Window filtering type. Types: 0, 1, 2 or 3. Type 0 stands for Hamming, 1 for Hann, 2 for Blackman, while 3 represents a rectangular window.
-d <dSize>	Sub-sampling value. This value among with the window size is calculated automatically. Nevertheless, for special purposes this value might be adjusted.
-wi <width>	Thickness of the image for each sequence. Default value is set to 25.
-p <posFile>	The positions from all the rearrangements detected on the run.
-o <outFile>	The output SVG image filename. The default uses the concatenation of reference with the target filenames (adding the "svg" extension). Beware: if the files are not in the working directory this might have problems due to several types of characters (such as '/').
[refFile]	The reference filename.
[tarFile]	The target filename.

2.2 Positions file

A file will be created, reporting each rearrangement position, with the following characteristics: type, id, start position, end position, direction.

An example ca be seen below, where columns are separated by a tab (ascii code:9)

type	id	start	end	direction
TARGET	1	12890	9068115	0-regular
REFERENCE	1	5542700	14243450	0-regular
TARGET	2	9100340	13134910	0-regular
REFERENCE	2	14243450	16350965	0-regular
REFERENCE	2	16383190	18129785	0-regular
TARGET	3	13154245	18883850	0-regular
REFERENCE	3	18136230	19083645	0-regular
REFERENCE	3	19160985	23666040	0-regular
TARGET	4	18961190	20920470	1-inverted
REFERENCE	4	23840055	23975400	0-regular
REFERENCE	4	24001180	24639235	0-regular
REFERENCE	4	24697240	25754220	0-regular

The first column reports if the regions are in the target or reference sequences and the second column sets an id for each similar region. The third and fourth columns, repectively, indicate the beginning and the end of each similar region, while the last column reports the direction (if was inverted or regular).

3. CITATION

On using this software/method please cite:

Diogo Pratas, Raquel M. Silva, Armando J. Pinho, Paulo J. S. G. Ferreira. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci. Rep. 5, 10203; doi: 10.1038/srep10203 (2015).

4. ISSUES

For any issue let us know at issues link.

5. LICENSE

GPL v3.

For more information see LICENSE file or visit

http://www.gnu.org/licenses/gpl-3.0.html

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
bin		bin
imgs		imgs
scripts		scripts
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

License

pratas/smash

Folders and files

Latest commit

History

Repository files navigation