# Cancer Genome Simulation Overview
This notebook provides a pipeline to run insilicoSV sequentially to simulate cancer genomes.
This notebook takes as input a YAML file containing:
  - The paths to several insilicoSV YAML config files.
  - A list of the tumor clones and the sequence of config files to run to obtain the clone genome.
  - The tumor purity of each clone for the simulation of the reads.
Refer to the provided clones.yaml config for an example of the expected syntax. 

## Generate the Clone Genomes

In [67]:
import sys

import yaml
from pysam import VariantFile
from IPython.display import Image
from collections import defaultdict
import os
import subprocess
import shutil
from math import ceil

In [68]:
number_threads = '10'

In [69]:
%%sh
rm -r ./clones/
mkdir -p clones

# download the chr21 reference
wget -O clones/chr21.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
gunzip -f clones/chr21.fa.gz

--2025-08-11 21:33:55--  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12709705 (12M) [application/x-gzip]
Saving to: ‘clones/chr21.fa.gz’

     0K .......... .......... .......... .......... ..........  0%  325K 38s
    50K .......... .......... .......... .......... ..........  0%  653K 28s
   100K .......... .......... .......... .......... ..........  1% 8.72M 19s
   150K .......... .......... .......... .......... ..........  1% 33.3M 14s
   200K .......... .......... .......... .......... ..........  2%  671K 15s
   250K .......... .......... .......... .......... ..........  2% 14.5M 13s
   300K .......... .......... .......... .......... ..........  2%  152M 11s
   350K .......... .......... .......... .......... ..........

In [70]:
%%sh
# copy the YAML config file
cp ./configs/clones.yaml ./clones/.

# display the config
cat ./clones/clones.yaml

reference: ./clones/chr21.fa
config_files:
                0: 'configs/clone_configs/svs_0.yaml'
                1: 'configs/clone_configs/svs_1.yaml'
                2: 'configs/clone_configs/svs_2.yaml'
                3: 'configs/clone_configs/svs_3.yaml'
                4: 'configs/clone_configs/svs_4.yaml'


clones:
    A:
     - 0
    B:
      - 0
      - 1
    C:
      - 0
      - 1
      - 2
    D:
      - 0
      - 1
      - 3
    E: [0, 4]

purity:
  A: 30
  B: 20
  C: 20
  D: 20
  E: 10

coverage: 30

In [93]:
def run_bash_process(command):
    try:
        output = subprocess.run(
        ' '.join(command),
        stdout=subprocess.DEVNULL,
        stderr=subprocess.PIPE,
        shell=True,
        text=True          
        )
        
        if output.stderr:
            print("STDERR:", output.stderr)

    except subprocess.CalledProcessError as e:
        print("Command failed with return code:", e.returncode)
        print(e.stderr)
        raise        

In [72]:
def call_insilicosv(path, config_name):
    command = ['insilicosv', '-c', path + config_name]
    run_bash_process(command)
    
    merge_command = ['cat', path + 'sim.hapA.fa',  path + 'sim.hapB.fa', '>', path + 'sim.fa']
    run_bash_process(merge_command)


In [73]:
def clonal_genome_generator(folder_path, config_name):
    reads_to_simulate = {}
    vcf_list = []
    with open(folder_path + config_name) as config_yaml:
        config = yaml.safe_load(config_yaml)

    for clone_name, dependencies in config['clones'].items():
        current_path = folder_path 
        previous_path = ''
        lineage_depth = 0
        for dependency in dependencies:
            lineage_depth += 1
            current_path = current_path + '/dependency_' + str(dependency) + '/'
            config_name = config['config_files'][dependency].split('/')[-1]

            if os.path.exists(current_path): 
                previous_path = current_path + '/sim.fa' 
                continue
            
            os.makedirs(current_path)
            shutil.copy(config['config_files'][dependency], current_path)
            
            # Append the previous insilicoSV output as new reference or use the initial reference if first call
            reference = config['reference']
            if previous_path:
                reference = previous_path
            with open(current_path + config_name, 'a') as file:
                file.write('\nreference: ' + reference + "\n")
            call_insilicosv(current_path, config_name)
            previous_path = current_path + '/sim.fa'
            vcf_list.append(current_path + 'sim.vcf')
        purity = config['purity'][clone_name]
        reads_to_simulate[clone_name] = [previous_path.split('sim.fa')[0], purity, lineage_depth]
    global_coverage = config['coverage']
    reference = config['reference']
    return reference, global_coverage, vcf_list, reads_to_simulate

In [74]:
root_path = './clones/'
reference, coverage, vcf_list, reads_to_simulate = clonal_genome_generator(root_path, 'clones.yaml')

STDERR: 2025-08-11 21:34:00,798 INFO insilicoSV version v0.1.3
2025-08-11 21:34:01,096 INFO Constructing SVs from 5 categories
2025-08-11 21:34:01,099 INFO Constructed 25 SVs
2025-08-11 21:34:01,099 INFO Deciding placement order for 25 SVs
2025-08-11 21:34:01,099 INFO Placing 25 svs
2025-08-11 21:34:01,106 INFO Placed 25 svs in 0.0s.
2025-08-11 21:34:01,106 INFO Writing outputs
2025-08-11 21:34:01,106 INFO Writing new haplotypes
2025-08-11 21:34:01,288 INFO Writing VCF file
2025-08-11 21:34:01,313 INFO Writing novel insertions file
2025-08-11 21:34:01,313 INFO Writing novel adjacencies file
2025-08-11 21:34:01,313 INFO Output path: ./clones//dependency_0
2025-08-11 21:34:01,313 INFO insilicoSV finished in 0.5s

STDERR: 2025-08-11 21:34:01,906 INFO insilicoSV version v0.1.3
2025-08-11 21:34:02,453 INFO Constructing SVs from 1 categories
2025-08-11 21:34:02,456 INFO Constructed 50 SVs
2025-08-11 21:34:02,457 INFO Deciding placement order for 50 SVs
2025-08-11 21:34:02,457 INFO Placing 50

## Read simulation
### Short-read simulation
Below we simulate paired-end short reads at the suited coverage given each clone purity and the requested total coverage using ```DWGSIM```. 

After generating the reads, we align them with minimap2 (short-read mode) and sort the alignments using ```samtools```.

In [75]:
def call_dwgsim(dwgsim_coverage, read_length, platform, genome, output_prefix):
    dwgsim_path = shutil.which('dwgsim')

    command = [dwgsim_path, '-C', str(dwgsim_coverage), '-1', str(read_length), '-2', str(read_length), '-o', platform, '-H', genome, output_prefix]
    run_bash_process(command)

In [97]:
def align_reads(platform, reference, reads, output_name):
    command_align = ['minimap2', '-t', number_threads, '-ax', platform, reference] + reads + ['|', 'samtools', 'sort', '-@', number_threads, '-o', output_name, '-']
    run_bash_process(command_align)
    
    command_index = ['samtools', 'index', '-@', number_threads, output_name]
    run_bash_process(command_index)
    
    command_get_coverage = ['samtools', 'coverage', output_name, '>', output_name + '.coverage']
    run_bash_process(command_get_coverage)
    
    with open(output_name + '.coverage', 'r') as cov_file:
        print('PROCESSED: Clone', output_name, 'coverage/n', cov_file.read())

In [77]:
def merge_clones(output_name, list_bams):
    print('Merging clones...')
    command = ['samtools', 'merge', '-@', number_threads, '-o', output_name] + list_bams
    run_bash_process(command)
    
    print('Indexing reads...')
    command_index = ['samtools', 'index', '-@', number_threads, output_name]
    run_bash_process(command_index)
    
    print('Computing coverage...')
    command_get_coverage = ['samtools', 'coverage', output_name, '>', output_name + '.coverage']
    run_bash_process(command_get_coverage)
    
    with open(output_name + '.coverage', 'r') as cov_file:
        print('PROCESSED: Whole cancer genome', output_name, 'coverage', cov_file.read())

We then merge all the simulated reads to obtain the requested coverage.

In [94]:
read_length = 151
dwgsim_platform = '0' # Illumina
alignment_platform = 'sr'
list_sr_bams = []
for clone_name, (genome_folder, purity, lineage_depth) in reads_to_simulate.items():
    print('Simulating reads for clone', clone_name)
    output_prefix = genome_folder + 'sim_sr.dwgsim'
    clone_coverage = ceil(coverage * purity / 100 / (2*lineage_depth))
    print('Clone purity', purity, ' Clone coverage', clone_coverage)
    call_dwgsim(clone_coverage, read_length, dwgsim_platform, genome_folder + '/sim.fa', output_prefix)
    
    print('Aligning reads')
    r1 = output_prefix + '.bwa.read1.fastq.gz'
    r2 = output_prefix + '.bwa.read2.fastq.gz'
    output_name = genome_folder + '/' + clone_name + '_sim_sr.dwgsim.bam'
    list_sr_bams.append(output_name)
    reads = [r1, r2]
    align_reads(alignment_platform, reference, reads, output_name)

Simulating reads for clone A
Clone purity 30  Clone coverage 5
STDERR: [dwgsim_core] chr21_hapA length: 46824501
[dwgsim_core] chr21_hapB length: 46818423
[dwgsim_core] 2 sequences, total length: 93642924
[dwgsim_core] Currently on: 
0
[dwgsim_core] 0
[dwgsim_core] 0
[dwgsim_core] 10000
[dwgsim_core] 20000
[dwgsim_core] 30000
[dwgsim_core] 40000
[dwgsim_core] 40000
[dwgsim_core] 50000
[dwgsim_core] 60000
[dwgsim_core] 70000
[dwgsim_core] 80000
[dwgsim_core] 80000
[dwgsim_core] 80000
[dwgsim_core] 90000
[dwgsim_core] 100000
[dwgsim_core] 110000
[dwgsim_core] 120000
[dwgsim_core] 130000
[dwgsim_core] 140000
[dwgsim_core] 140000
[dwgsim_core] 140000
[dwgsim_core] 150000
[dwgsim_core] 160000
[dwgsim_core] 170000
[dwgsim_core] 170000
[dwgsim_core] 180000
[dwgsim_core] 190000
[dwgsim_core] 200000
[dwgsim_core] 210000
[dwgsim_core] 220000
[dwgsim_core] 230000
[dwgsim_core] 240000
[dwgsim_core] 250000
[dwgsim_core] 260000
[dwgsim_core] 270000
[dwgsim_core] 280000
[dwgsim_core] 290000
[dwgsim_c

In [79]:
output_name_sr = root_path + 'cancer_genome_sr.bam'
merge_clones(output_name_sr, list_sr_bams)

Merging clones...
Indexing reads...
Computing coverage...
PROCESSED: Whole cancer genome ./clones/cancer_genome_sr.bam coverage #rname	startpos	endpos	numreads	covbases	coverage	meandepth	meanbaseq	meanmapq
chr21	1	46709983	11682857	40069331	85.7832	37.7478	17.4	54.4



### Long-read simulation
We use PBSIM3 to simulate HiFi reads from the synthetic genome. Since PBSIM3 outputs reads for each reference contig, we also combine the reads from the two synthetic haplotypes into a single FASTQ file.

After generating the reads, we align them with minimap2 (HiFi mode) and sort the alignments using ```samtools```.

In [80]:
def call_pbsim(pbsim_coverage, read_length_mean, accuracy_mean, genome, output_prefix):
    conda_prefix = os.environ.get('CONDA_PREFIX')
    command = ['pbsim', '--depth', str(pbsim_coverage), '--genome', genome, '--prefix', output_prefix, '--strategy', 'wgs', '--method', 'qshmm', '--qshmm', conda_prefix + '/data/QSHMM-RSII.model', '--length-mean', read_length_mean, '--accuracy-mean', accuracy_mean]
    run_bash_process(command)
    print('Merging haplotypes')
    merge_command = ['zcat', output_prefix + '*.fq.gz', '>>', output_prefix + '.fastq']
    run_bash_process(merge_command)

In [85]:
read_length_mean = '20000'
accuracy_mean = '0.999'
alignment_platform = 'map-hifi'
print('Simulating long reads at coverage', coverage)
list_lr_bams = []
for clone_name, (genome_folder, purity, lineage_depth) in reads_to_simulate.items():
    print('Simulating reads for clone', clone_name)
    clone_coverage = coverage * purity / 100 / (2*lineage_depth)
    print('Clone purity', purity, 'Clone coverage', clone_coverage)
    output_prefix = genome_folder + 'sim_lr.pbsim'
    call_pbsim(clone_coverage, read_length_mean, accuracy_mean, genome_folder + '/sim.fa', output_prefix)
    
    print('Aligning reads')
    reads = [output_prefix + '.fastq']
    output_name = genome_folder + '/' + clone_name + '_sim_lr.pbsim.bam'
    list_lr_bams.append(output_name)
    align_reads(alignment_platform, reference, reads, output_name)

Simulating long reads at coverage 30
Simulating reads for clone A
Clone purity 30 Clone coverage 4.5 30 30 1
STDERR: :::: Simulation parameters :::

strategy : wgs
method : qshmm
qshmm : /Users/ebattist/anaconda3/envs/insilicosv-demo-env/data/QSHMM-RSII.model
genome : ./clones//dependency_0///sim.fa
prefix : ./clones//dependency_0//sim_lr.pbsim
id-prefix : S
depth : 4.500000
length-mean : 20000.000000
length-sd : 7000.000000
length-min : 100
length-max : 1000000
difference-ratio : 6:55:39
seed : 1755008532
accuracy-mean : 0.990000
pass_num : 1
hp-del-bias : 1.000000

:::: Reference stats ::::

file name : ./clones//dependency_0///sim.fa

ref.1 (len:46824501) : chr21_hapA
ref.2 (len:46818423) : chr21_hapB

:::: Simulation stats (ref.1) ::::

read num. : 10473
depth : 4.500005
read length mean (SD) : 20119.401413 (6927.139265)
read length min : 3693
read length max : 54348
read accuracy mean (SD) : 0.960128 (0.042890)
substitution rate. : 0.002401
insertion rate. : 0.022033
deletion rate

In [89]:
output_name_lr = root_path + 'cancer_genome_lr.bam'
merge_clones(output_name_lr, list_lr_bams)

Merging clones...
Indexing reads...
Computing coverage...
PROCESSED: Whole cancer genome ./clones/cancer_genome_lr.bam coverage #rname	startpos	endpos	numreads	covbases	coverage	meandepth	meanbaseq	meanmapq
chr21	1	46709983	36067	40071294	85.7874	14.7832	30.4	57



## SV Visualization
Below we generate ```samplot``` illustrations for each SV simulated with ```insilicoSV```.

In [92]:
samplot_path = root_path + 'samplot/'
os.makedirs(samplot_path, exist_ok=True)

os.environ["MPLBACKEND"] = "Agg"

for vcf_path in vcf_list:
    vcf = VariantFile(vcf_path)
    rec2breakends = defaultdict(set)
    for vcf_rec in vcf.fetch():
        vcf_info = dict(vcf_rec.info)        
        sv_title = "%s_%s_%s" % (vcf_info.get('SVTYPE', 'SNP'), 
                                 vcf_info.get('SVID', vcf_rec.id), 
                                 vcf_info.get('GRAMMAR', '').replace("->", "-to-"))
        rec2breakends[sv_title].add(vcf_rec.start)
        rec2breakends[sv_title].add(vcf_rec.stop)
        if 'TARGET' in vcf_info:
            rec2breakends[sv_title].add(vcf_info['TARGET'])
    vcf.close()
               
    for sv_title, sv_breakends in rec2breakends.items():
        sv_breakends = sorted(sv_breakends)
        output_file = "%s/%s.png" % (samplot_path, sv_title)
        start = min(sv_breakends)
        end = max(sv_breakends)
        wlen = end - start + 1000
        command = ["samplot plot -n Illumina HiFi -b %s %s -s %s -e %s -c chr21 -t %s -w %d" \
              " --include_mqual 0 --separate_mqual 1 -o %s" % (output_name_sr, output_name_lr, 
                                                               start, end, sv_title, wlen, output_file)]
        run_bash_process(command)

['samplot plot -n Illumina HiFi -b ./clones/cancer_genome_sr.bam ./clones/cancer_genome_lr.bam -s 6664646 -e 6671835 -c chr21 -t DUP_INV_sv17_A-to-aa -w 8189 --include_mqual 0 --separate_mqual 1 -o ./clones/samplot//DUP_INV_sv17_A-to-aa.png']
STDERR: /bin/sh: samplot: command not found

['samplot plot -n Illumina HiFi -b ./clones/cancer_genome_sr.bam ./clones/cancer_genome_lr.bam -s 8600331 -e 8609963 -c chr21 -t INV_sv13_A-to-a -w 10632 --include_mqual 0 --separate_mqual 1 -o ./clones/samplot//INV_sv13_A-to-a.png']
STDERR: /bin/sh: samplot: command not found

['samplot plot -n Illumina HiFi -b ./clones/cancer_genome_sr.bam ./clones/cancer_genome_lr.bam -s 13373438 -e 13388315 -c chr21 -t CUSTOM_sv21_AB-to-AbbA -w 15877 --include_mqual 0 --separate_mqual 1 -o ./clones/samplot//CUSTOM_sv21_AB-to-AbbA.png']
STDERR: /bin/sh: samplot: command not found

['samplot plot -n Illumina HiFi -b ./clones/cancer_genome_sr.bam ./clones/cancer_genome_lr.bam -s 13757028 -e 13763976 -c chr21 -t DUP_sv6

In [None]:
%%sh
ls -l clones/samplot/

We visualize some SVs below.

In [None]:
Image(filename=samplot_path + '/DEL_sv0_A-to-.png')

In [None]:
Image(filename=samplot_path + 'DUP_sv5_A-to-AA+.png')

In [None]:
Image(filename=samplot_path + 'INV_sv10_A-to-a.png')

In [None]:
Image(filename=samplot_path + 'Custom_sv15_A-to-aa.png')

In [None]:
Image(filename=samplot_path + 'Custom_sv20_AB-to-AbbA.png')