Simulation of Tumor Genomes -- Initiated at the 2017 NYGC-NCBI Hackathon
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


A python package that simulates the structural variations of cancer genomes. -- Initiated at the 2017 NYGC-NCBI Hackathon


There is no "ground truth" for detecting structural variations in cancer genomics. Complex rearrangements occur per individual cancer samples, with no distinction in the sequence of mutations (SNPs, copy number variations and indels to CNVs, transpositions, and duplication) that lead to any specific structural arrangement. Simulating the possible SNVs and CNVs in a unique variation of a reference genome followed by simulation tumor model of cancer-related variations will be beneficial in better understanding the pathways of cancer progression after a mutation event, as found in cell division, aneuploidy, and other relative phenomena.


Generate a simulated tumor genome, based on a user-provided genome file as reference.


Dependencies: found in requirements.txt, please be sure to have them installed on your system. The software package is written in Python and contained in the lib folder.

> pip install -r requirements.txt


The user can provide their reference genome in FASTA format as input file to, or use our default example fasta.

> cd lib
> python 
> or (optional)
> python [-usage] <path/to/input_file>

To view the help options, type -h:

> python -h
Usage: [-h] [--input_fasta INPUT_FASTA]
                            [--output_tumor_fasta OUTPUT_TUMOR_FASTA]
                            [--output_normal_fasta OUTPUT_NORMAL_FASTA]

Simulate cancer genomic structural variations

optional arguments:
  -h, --help            show this help message and exit
  --input_fasta INPUT_FASTA
                        file path for the input (default genome) fasta
  --output_tumor_fasta OUTPUT_TUMOR_FASTA
                        file path for the output tumor (cancer genome) fasta
  --output_normal_fasta OUTPUT_NORMAL_FASTA
                        file path for the output normal (SNV-added) fasta

How it works calls upon to generate the random mutations for the normal unique genome case. uses the lower level for the simpler mutations and logs the distinction of indels, translocations, duplications and inversions.


To run unit tests, run nosetests from the top-lelel directory of the project. If you encounter errors, such as test_main failing because it cannot find the data in the data folder, it's most likely because you are not running tests from the top-level directory. A subsampled version of hg38 is also provided in the data folder of this repository. To download the reference FASTA hg38 or hg19, use the following commands:

For reference fasta hg38.fa (approximately 3.0 GB):

> wget
> tar -xzf hg38.chromFa.tar.gz
> cd chroms
> cat chr*.fa > hg38.fa
> rm chr*.fa

For reference fasta hg19.fa (approximately 3.1 GB):

> wget
> tar -xzf chromFa.tar.gz
> cat chr*.fa > hg19.fa
> rm chr*.fa