A complete command-line pipeline to simulate genomic variations
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
.gitignore
Exome_Script.py
LICENSE
README.md
art_run.sh

README.md

Introduction

Varsimlab is a command line python pipeline designed to easily generate artificial short reads, with structural and copy number variations. Using ART short read simulation and SInC error generation, Varsimlab can quickly simulate biologically realistic tumor and normal short reads for either genome or exome simulations.

Varsimlab can generate SNPs, INDELS, and CNVs to user specification, and can simulate tumor heterogeneity and polyploidy.

For more information, see the Varsimlab documentation site here

Setup and Dependencies

Varsimlab can be called from the command line using any python 3 version

Varsimlab uses art_illumina to generate short reads with realistic sequencing errors. The documentation is available here

To install ART

 curl -O https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier2016.06.05linux64.tgz
tar -xvzf artbinmountrainier2016.06.05linux64.tgz

VarSimLab uses SInC simulator to generate biologically realistic tumor genomic variations. The source files and instructions on compiling are available here

To install SINC, first download source files, then run

gcc -o genProfile genProfile.c
gcc -o SInC_simulate SInC_simulate.c -lm -lgsl -lgslcblas
gcc -o SInC_readGen -O2 SInC_readGen.c -lgsl -lgslcblas -lpthread

If you'd like to use Varsimlabs exome sequencing capabilities, Varsimlab uses Bedtools is required. bedtools documentation is available here

To install bedtools (optional: only used in Exome sequence simulation)

wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxvf bedtools-2.25.0.tar.gz
cd bedtools2
make

Finally, if you'd like VarSimLab to automatically align the reads you've generated back to your query sequence, download BWA. BWA documentation is available here

installation instructions for BWA (optional: only used for alignment if the -sam flag is invoked), are available here

Quick Start Guide

genome sequence usage

To simulate tumor and normal reads for a genome sequence Varsimlab requires 3 arguments

  1. A name for the output file which will contain tumor and normal reads
  2. The reference sequence you are interested in simulating
  3. the flag -use_genome

References for the human genome can be found online here

an example use case

 python3 Exome_Script.py output_directory ~/chr20.fa -use_genome

exome sequence usage

To simulate tumor and normal reads for a exome sequence, Varsimlab requires the following

  1. A name for the output file which will contain tumor and normal reads
  2. The reference sequence you are interested in simulating
  3. -bed argument, followed by the path to a tab seperated bed file containing the positions of the exonic regions of your sequence

An example use case:

python3 Exome_script.py output_directory ~/chr20.fa -bed chr20.bed

Step 3 might sound daunting but fortunately UCSC tablebrowser makes it easy. The website can be found here and here is an explanatory video from USCS that may be helpful

Exome bed file explanatory video

output format

Exome_script.py will generate a directory with the name given. In this directory will be a the following:

  1. A log file called SIMULATION_IS_COMPLETE.txt, containing details about the run and the parameters provided.
  2. A directory labeled normal, with the normal reads in files ending in .fq
  3. A directoy labeled tumor, with tumor reads in files ending in .fq, and error benchmark files, with the positions and variant for each of the tumor variations generated

Available Arguments

to see this guide on the command line, type

python3 Exome_Script.py -h 
usage: Exome_Script.py [-h] (-use_genome | -bed BED) [-c C] [-s] [-l L] [-sam]
                       [-m M] [-cnv CNV] [-cnv_min_size CNV_MIN_SIZE]
                       [-cnv_max_size CNV_MAX_SIZE] [-snp SNP] [-indel INDEL]
                       [-ploidy {1,2,3}] [-subclones SUBCLONES]
                       filename genome

positional arguments:
  filename              name of output file
  genome                genome to be processessed

optional arguments:
  -h, --help            show this help message and exit
  -use_genome           generate tumor and normal for entire provided sequence
  -bed BED              generate tumor and normal based on bed file containing
                        exonic regions

read generation parameters:
  arguments to adjust read generation

  -c C                  read depth of coverage
  -s                    use single end reads (default paired)
  -l L                  read length. default 100 bp
  -sam                  align reads to reference genome, and output sam files.
  -m M                  maximum distance for two exonic ranges to be merged into
                        one range. If zero, merges only those ranges that
                        directly overlap with each other

error parameters:
  arguments to adjust tumor error generation

  -cnv CNV              percent of total input to be incorporated into a CNV.
                        Values from 0 to 100. 4 would signify 4 percent of
                        input should be included in CNVs
  -cnv_min_size CNV_MIN_SIZE
                        minimum size of CNVs
  -cnv_max_size CNV_MAX_SIZE
                        CNV_max_size
  -snp SNP              percent of total input to be turned into SNPs. Values
                        from 0 to 100. A value of 5 indicates 5 percent of
                        genome should be turned into SNPs
  -indel INDEL          percent of total input to be included in INDELS.
                        values from 0 to 100, a value of 1 indicates 1 percent
                        of the genome should be included in indels
  -ploidy {1,2,3}       tumor ploidy. default diploid
  -subclones SUBCLONES  generate multiple tumor subclones, to simulate tumor
                        heterogeneity