Detection of structural variants in cancer mate-pair and paired-end data
Python R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
utils Cleaning, added a lot of comments Apr 1, 2015


SV-Bay is a tool for structural variant detection in cancer genomes using a Bayesian approach with correction for GC-content and read mappability. The algorithm description can be found in the article.


SV-Bay is implemented is Python 2 and was tested in both Linux and Mac OS X. Though it works with both Python 2.6 and 2.7, we strongly recommend to use 2.7, as it shows a significant performance improvement due to the difference in GC implementations.

A number of python libraries are required to run SV-Bay. The installation from scratch for Ubuntu 14.04 is shown below:

sudo apt-get update
sudo apt-get install build-essential python-dev zlib1g-dev unzip
sudo apt-get install python-numpy python-scipy python-matplotlib
sudo python
sudo pip install pyaml pysam joblib

After that you can clone SV-Bay repository:

sudo apt-get install git
git clone

Then edit sample config.yaml file and proceed to input data preparation, as explained below.


SV-Bay uses config file in YAML format. This file is common for all processing steps. There are following config options not related to input data (options related to input data are described in the next section):

working_dir : "/.../sv-bay-data/" Common directrory for all processing. All other files and folders will be created inside this one.

clustering_parallel_processes : 1 Number of parallel threads for clustering. SV-Bay works fast even with one process (less then 2 hours for mate-pair data with coverge 12). If the number of processes is more than 1, clustering log would be unordered and very hard to read, so change it only if speed is crucial for you.

chromosomes : ["chr14","chr15", "chr17"] Chromosomes to process.

exp_num_sv: 100 Expected number of structural variants.

alpha : 0.01 Distribution cutoff used when deciding whether read is normal or abnormal.

read_length : 50 Read length in input data.

ploidy : 4.0 Ploidy of input data.

numb_allel : 8 Minimum alleles to mark SV as co-amplification.

links_probabilities_file : "links_probabilities.txt" File to output resulting clusters and probabilities.

valid_links_dir : "valid_links/" Directory to output resulting valid clusters.

There are also several internal options: debug, clustering_log_file, probabilites_log_file, normal_fragments_dir, length_histogram_file, clusters_files_dir, lambda_file, serialized_stats_file. They are described in sample config.yaml file, generally it is unnecessary to change their default values.

Input data

SV-Bay requires a number of input files to work. It can look a bit confusing, but most of this files are common for human genome and can be simply downloaded. Config options related to input are described below:

sam_files_dir : "bam/" Input directory with per-chromosome bam or sam files. Bam should be sorted and indexed, .bam.bai files should be in the same folder. Name of file for each chromosome must contain "chrSomething" in it's name, e.g. "chr7_sorted.bam" or "chrX.sam". If you have one bam for the whole genome, use utils/ script to split it:

python src/utils/ -i yourBigBAMfile.bam -o outputDir/

fa_files_dir : "fa/" Input directory with per-chromosome .fa files. Fa file names should consist exactly of chromosome name and extension, e.g. chr14.fa. You can download fa files for hg19 and hg38: and

gem_files_dir : "gem/" Input directory with per-chromosome .gem mappability files. Gem file names should consist exactly of chromosome name and extension, e.g. chr14.gem. You can download pre-calculated gem for hg19 and hg38: and If you have one gem for the whole genome, use utils/ script to split it:

python src/utils/ -i yourBigGEMfile.gem  -o outputDir/

centromic_file : "centrom_hg38.txt" Input file with information about centromere positions in human genome. Files for hg19 and hg38 are availdable in data subfolder of SV-Bay repository (data/centrom_hg19.txt and data/centrom_hg38.txt).

cnv_file: "simulated_reads_cnv.txt" File generated by Control-FREEC. For the test data it is available in data subfolder of SV-Bay repository (data/simulated_reads_cnv.txt).

Preparation of the example data to run SV-Bay is shown below:

mkdir sv-bay-data/ && cd sv-bay-data
mkdir bam && cd bam
wget && tar xzf bam_tumor.tar.gz && mv bam_tumor/* . && cd ..
mkdir fa_files && cd fa_files
wget && unzip && cd ..
mkdir gem_files && cd gem_files
wget && tar xzf gem_hg38.tar.gz && mv gem_hg38/* . && cd ..
cp ~/SV-Bay/data/centrom_hg38.txt .
cp ~/SV-Bay/data/simulated_reads_cnv.txt .

Now change working_dir in sample config and you are ready to run SV-Bay.


SV-Bay workflow consists of 3 steps. Config file is common for all steps.

Normal/abnormal fragments separation and clustering

On this step SV-Bay calculates statistics of fragment length distribution, separates normal/abnormal fragments and clusters abnormal fragments.

python -B src/ -c config/config.yaml

Applying probabilistic model to validate clusters

On this step SV-Bay calculates probability for each cluster to determine whether it is noise or real SV.

python -B src/ -c config/config.yaml

Complex SVs assembly

On this step SV-Bay assembles clusters to complex and simple SVs and outputs final results.

python -B src/ -c config/config.yaml > results

The script can also exclude germline mutations, if the respective data is available. To do so, run for germline dataset using a separate working_dir and than run with flag -n and name of the folder with germ-line clusters:

python -B src/ -c config/config_germ.yaml
python -B src/ -c config/config.yaml -n '/home/sv-bay/sv-bay-data-germ/cluster_files/' > results

###Test data Please download example tumor and control bam files for chromosomes 14, 15, 17 to test SV-Bay here, this is tar.gz file approximately 1.7GB which contains all data that you would need to run the tool for 3 chromosomes:

  • separated fasta files (hg38 version)
  • .gem files
  • bam files for tumor samples
  • bam files for control samples
  • Results of FREEC tool run (for tumor sample)
  • list of centromeres (hg38 version) Config file for test data is located in