Copy Number Variations (CNV) Simulator
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
cnvsim
.gitignore
Dockerfile
README.md
cnv-sim.py
setup-linux.sh
setup-mac.sh

README.md

Copy Number Variation Simulator (CNV-Sim)

In genomics, Copy Number Variations (CNVs) are a type of structural variation in a genome where sections of the genome are duplicated or deleted. The number of variations (duplications/deletions) varies between individuals in the human population.

The Copy Number Variation Simulator (CNV-Sim) is a simulation tool that extends the functionality of existing next-generation sequencing read simulators to introduce copy number variations in the generated reads. The resulting reads encompass amplifications as well as deletions according to a predefined list of variant regions.

CNV-Sim aids testing and benchmarking tools for copy number variation detection and analysis. The tool offers two types of simulation:

  • CNV simulation in whole genome. CNV-Sim utilizes the functionality of ART to introduce variations in the genome.
  • CNV simulation in targeted exome. CNV-Sim utilizes the functionality of Wessim to introduce variations in the targets.

Use CNV-Sim

Download as a standalone application

Run the setup script appropriate for your operating system to install required dependencies.

Alternatively, install the following dependencies on your machine (the setup script does all that for you):

Refer to the below CNV-Sim options section for more details on how to use all available command-line options

Download as a Docker image

We prefer that you run CNV-Sim from the Docker container as it has all the dependencies installed inside already; no need to go through any of the setup scripts included here.

New to Docker? Read this blog post to understand how it works (you don't need to be a Docker expert to get CNV-Sim Docker container running!).

Install Docker: Linux, Mac OS and Windows

How to use CNV-Sim Docker image:

docker run -v <absolute_local_path_to_input_directory>:/my_data nabavilab/cnv-sim ./cnv-sim.py -o /my_data/<simulation_name> [OPTIONS] {genome, exome} /my_data/<genome_file> [/my_data/<target_file>]

where:

  • <absolute_local_path_to_input_directory> is the directory where the required input files exist (genome.fa and target.bed if using exome simulation); MUST be absolute path
  • <simulation_name> is used as the name of the output folder in your data directory, where the output of CNV Sim will be saved
  • <genome_file> is the FASTA file name for the genome reference
  • <target_file> is the BED file name for the targets (only if using exome as the simulation type)

Refer to the below CNV-Sim options section for more details on how to use all available command-line options or simply type: docker run nabavilab/cnv-sim ./cnv-sim -h to read the help.

Use from Galaxy

CNV-Sim is available as a Galaxy tool.

Download it from the Galaxy Tool Shed: https://toolshed.g2.bx.psu.edu/view/ahosny/cnvsim/

Required Input

IMPORTANT: for the current version of CNV-Sim, the underlying pysam library used by Wessim might not be able to handle large reference files (such as the whole human genome) when introducing amplifications. To avoid errors, simulate CNVs in one chromosome at a time (reference and target for one chromosome in a single run). We are working on handling large files seamlessly and this fix will be released in the next version.

whole genome

  • Reference genome in FASTA format.

targeted exome

  • Reference genome in FASTA format.
  • Target exons file in BED format. The exome consists of introns and exons. The target file here should indicate the start and end positions of exons (not exomes).

CNV-Sim options

Usage: cnv-sim.py [-h] [-v] [-o OUTPUT_DIR_NAME] [-n N_READS] [-l READ_LENGTH]
                  [--cnv_list CNV_LIST] [-g REGIONS_COUNT]
                  [-r_min REGION_MINIMUM_LENGTH]
                  [-r_max REGION_MAXIMUM_LENGTH] [-a AMPLIFICATIONS]
                  [-d DELETIONS] [-cn_min COPY_NUMBER_MINIMUM]
                  [-cn_max COPY_NUMBER_MAXIMUM]
                  {genome,exome} genome [target]

Generates NGS short reads that encompass copy number variations in whole
genome and targeted exome sequencing

Positional arguments:
  {genome,exome}        simulate copy number variations in whole genome or
                        exome regions
  genome                path to the referece genome file in FASTA format
  target                path to the target regions file in BED format (if
                        using exome) (default: None)

Optional arguments:
  -h, --help            show this help message and exit
  -v, --version         Show program's version number and exit.
  -o OUTPUT_DIR_NAME, --output_dir_name OUTPUT_DIR_NAME
                        a name to be used to create the output directory
                        (overrides existing directory with the same name).
                        (default: simulation_output)
  -n N_READS, --n_reads N_READS
                        total number of reads without variations (default:
                        10000)
  -l READ_LENGTH, --read_length READ_LENGTH
                        read length (bp) (default: 100)
  --cnv_list CNV_LIST   path to a CNV list file in BED format chr | start |
                        end | variation. If not passed, it is randomly
                        generated using CNV list parameters below (default:
                        None)

CNV list parameters:
  parameters to be used if CNV list is not passed

  -g REGIONS_COUNT, --regions_count REGIONS_COUNT
                        number of CNV regions to be generated randomly
                        (default: 20)
  -r_min REGION_MINIMUM_LENGTH, --region_minimum_length REGION_MINIMUM_LENGTH
                        minimum length of each CNV region (default: 1000)
  -r_max REGION_MAXIMUM_LENGTH, --region_maximum_length REGION_MAXIMUM_LENGTH
                        maximum length of each CNV region (default: 100000)
  -a AMPLIFICATIONS, --amplifications AMPLIFICATIONS
                        percentage of amplifications in range [0.0: 1.0].
                        (default: 0.5)
  -d DELETIONS, --deletions DELETIONS
                        percentage of deletions in range [0.0: 1.0]. (default:
                        0.5)
  -cn_min COPY_NUMBER_MINIMUM, --copy_number_minimum COPY_NUMBER_MINIMUM
                        minimum level of variations (copy number) introduced
                        (default: 3)
  -cn_max COPY_NUMBER_MAXIMUM, --copy_number_maximum COPY_NUMBER_MAXIMUM
                        maximum level of variation (copy number) introduced
                        (default: 10)

Contributor

Abdelrahman Hosny (@abdelrahmanhosny)

License

The MIT License (MIT)

Copyright (c) 2016 NabaviLab