# SV Simulation Tutorial:
This notebook demonstrates the process of creating a synthetic genome from an input reference, and then simulating and aligning reads and inspecting the results. Below we provide an example configuration file specifying the SVs to be included in the synthetic output genome and insert those SVs into chromosome 21 of GRCh38. We provide example calls to DWGSIM and PBSIM3 to generate synthetic paired-end short reads and HiFi long reads, but this procedure is generalizable to any read simulator. Lastly, we provide visualization code to view the size distribution of the simulated SVs as well as the pileup images of the reads in the impacted regions of the genome.

Please make sure to proceed to the installation of the environment described in the docs/demo_notebook.md readme.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pysam import VariantFile
from intervaltree import Interval, IntervalTree
from IPython.display import Image

## Synthetic reference simulation
`demo_config.yaml` gives the composition of the set of SVs to be input into the reference. Here we've included 5 examples each of the types deletion (DEL), tandem duplication (DUP), inversion (INV), and dispersed duplication (dDUP).

In [None]:
%%sh
cat ./configs/demo_config.yaml

Downloading chr21 reference

In [None]:
%%sh
mkdir -p chr21_ref

wget -O chr21_ref/chr21.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
gunzip -f chr21_ref/chr21.fa.gz

Instantiating and populating the output directory (insilicosv places all output files in the directory containing the config, so we copy the demo config there).

In [None]:
%%sh
mkdir -p output
cp ./configs/demo_config.yaml ./output/.

`insilicosv` is called with the above config as input, and a random seed is set for the simulation (an optional input).

In [None]:
%%sh
insilicosv -c ./output/demo_config.yaml

Below we show the VCF generated from the simulation

In [None]:
%%sh
cat output/sim.vcf

## Read simulation
### Short-read simulation
Below we combine the two haplotype fasta sequence and generate synthetic paired-end short reads at 10x coverage. Definitions for some of the DWGSIM input parameters are given below:
```
         -C FLOAT      mean coverage across available positions (-1 to disable) [100.00]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -y FLOAT      probability of a random DNA read [0.05]
         -S INT        generate reads [0]:
                           0: default (opposite strand for Illumina, same strand for SOLiD/Ion Torrent)
                           1: same strand (mate pair)
                           2: opposite strand (paired end)
```

In [None]:
%%sh
mv output/sim.hapA.fa output/sim.fa
cat output/sim.hapB.fa >> output/sim.fa
rm output/sim.hapB.fa

COVERAGE="${INSILICOSV_DEMO_COVERAGE_SHORT:-5}"
READ_LEN=151
dwgsim -C $COVERAGE -1 $READ_LEN -2 $READ_LEN -y 0 -S 0 -c 0 -o 1 -m /dev/null -H output/sim.fa output/sim_sr.dwgsim

After generating the reads, we align with BWA

In [None]:
%%sh
bwa index chr21_ref/chr21.fa
bwa mem chr21_ref/chr21.fa output/sim_sr.dwgsim.bwa.read1.fastq.gz output/sim_sr.dwgsim.bwa.read2.fastq.gz | samtools view -Sb - > output/sim_sr.bwamem.bam

After alignment, we sort and index the .bam

In [None]:
%%sh
samtools sort output/sim_sr.bwamem.bam > output/sim_sr.bwamem.sorted.bam

In [None]:
%%sh
rm output/sim_sr.bwamem.bam
samtools index output/sim_sr.bwamem.sorted.bam

### Long-read simulation
The command below uses PBSIM3 to simulate HiFi reads from the synthetic genome (again at 10x coverage). PBSIM3 outputs reads for each reference contig so in this case we combine the reads from the two synthetic haplotypes and create a .bam (with minimap2) which we then sort and index.

In [None]:
%%sh
COVERAGE="${INSILICOSV_DEMO_COVERAGE_LONG:-5}"
pbsim --strategy wgs --method qshmm --qshmm $CONDA_PREFIX/data/QSHMM-RSII.model --depth $COVERAGE --accuracy-mean 0.999 --accuracy-min 0.99 --length-min 18000 --length-mean 20000 --length-max 22000 --genome output/sim.fa --prefix output/sim_lr

In [None]:
%%sh
cat output/sim_lr_*.fastq >> output/sim_lr.fastq
rm output/sim_lr_*.fastq
rm output/*.maf
rm output/*.ref

In [None]:
%%sh
minimap2 -ax map-hifi chr21_ref/chr21.fa output/sim_lr.fastq | samtools view -Sb > output/sim_lr.minimap2.bam

In [None]:
%%sh
samtools sort output/sim_lr.minimap2.bam > output/sim_lr.minimap2.sorted.bam

In [None]:
%%sh
rm output/sim_lr.minimap2.bam
samtools index output/sim_lr.minimap2.sorted.bam

------

## SV Size Visualization
Below we provide a simple parsing function to convert the VCF generated above into a dataframe that can easily be plotted or fed to other pos-hoc analyses.

In [None]:
def insilico_bench_to_df(input_vcf):
    vcf = VariantFile(input_vcf)
    callset_info = {'chrom': [], 'start': [], 'end': [], 'component': [], 'length': [], 'type': [], 'parent_type': [], 'context': []}
    for rec in vcf.fetch():
        parent_svtype = rec.info['PARENT_SVTYPE'] if 'PARENT_SVTYPE' in rec.info else rec.info['SVTYPE']
        callset_info['chrom'].append(rec.chrom)
        callset_info['start'].append(rec.start)
        callset_info['end'].append(rec.stop)
        callset_info['parent_type'].append(parent_svtype)
        callset_info['type'].append(rec.info['SVTYPE'])
        callset_info['component'].append('source')
        callset_info['length'].append(rec.info['SVLEN'])
        callset_info['context'].append('None' if 'OVERLAP_EV' not in rec.info else rec.info['OVERLAP_EV'])
        if 'TARGET' in rec.info and rec.info['TARGET'] > rec.stop + 1:
            disp_interval = (rec.stop, rec.info['TARGET']) if rec.info['TARGET'] > rec.stop else (rec.info['TARGET'], rec.start)
            callset_info['chrom'].append(rec.chrom)
            callset_info['start'].append(rec.start)
            callset_info['end'].append(rec.stop)
            callset_info['parent_type'].append(parent_svtype)
            callset_info['type'].append(rec.info['SVTYPE'] + '_disp')
            callset_info['component'].append('dispersion')
            callset_info['length'].append(disp_interval[1] - disp_interval[0])
            callset_info['context'].append('None')
    vcf_df = pd.DataFrame(callset_info)
    return vcf_df

In [None]:
sim_df = insilico_bench_to_df('output/sim.vcf')

In [None]:
sim_df

In [None]:
def plot_simulation(df):
    violin_color_palette = {'source': '#85C1E9', 'dispersion': '#EB984E'}
    f, ax = plt.subplots(figsize=(10,5))
    sns.set_style('white')
    sns.set_style('ticks')
    sns.violinplot(data=df, x='parent_type', y='length', hue='component', split=False, order=['DEL', 'DUP', 'INV', 'dDUP'], hue_order=['source', 'dispersion'], palette=violin_color_palette)
    sns.despine(offset=10, trim=True)
    ax.set(xlabel='', ylabel='Length')

In [None]:
plot_simulation(sim_df)

## IGV Visualization
Below we provide infrastructure for the automatic visualization of insilicoSV output in IGV. The IGV batch script generated by this function can be input into desktop or commandline IGV, but for this example we will call into commandline IGV.

In [None]:
%%sh
mkdir -p IGV_screenshots

In [None]:
def is_contained(query_interval, interval):
    # helper function to check for containment of query interval in any of the intervals stored in the tree
    return interval.begin <= query_interval.begin and\
           interval.end >= query_interval.end and\
           interval.data == query_interval.data


def generate_script(input_vcfs, bam_paths, vcf_paths, output_batch_script_path, igv_screenshot_dir, min_svlen,
                    max_svlen, genome='hg38', colorby_ins_size=True, groupby_pair_orientation=True,
                    viewaspairs=True, skip_duplicates=True):
    """
    method to generate a batch script for an input vcf
    """
    out = open(output_batch_script_path, 'w')
    out.write('new\n')
    out.write(f'genome {genome}\n')
    out.write('preference SAM.MAX_VISIBLE_RANGE 1000\n')
    out.write('preference SAM.SHOW_MISMATCHES FALSE\n')
    for bam_path in bam_paths:
        out.write(f'load {bam_path}\n')
    for vcf_path in vcf_paths:
        out.write(f'load {vcf_path}\n')

    # maintaining an interval tree of the genome intervals that have been screenshot in case
    # there are multiple input records that will have been captured by a single screenshot
    screenshot_intervals = IntervalTree()

    write_first_time = True
    for input_vcf in input_vcfs:
        input_vcf_file = VariantFile(input_vcf)
        for rec in input_vcf_file.fetch():
            if skip_duplicates and any([is_contained(Interval(rec.start, rec.stop, rec.chrom), ivl)
                                        for ivl in list(screenshot_intervals.overlap(rec.start, rec.stop))]):
                continue
            start, end = rec.start, rec.stop
            if 'TARGET' in rec.info:
                if rec.info['TARGET'] > end:
                    end = rec.info['TARGET']
                else:
                    start = rec.info['TARGET']
            svlen = end - start
            if min_svlen <= svlen <= max_svlen:
                # setting the margin on either side of the interval to svlen/10
                screenshot_margin = svlen // 10
                start_pos = str(start - screenshot_margin)
                end_pos = str(end + screenshot_margin)
                svtype = rec.info['SVTYPE']
                out.write('goto ' + rec.chrom + ':' + start_pos + '-' + end_pos + '\n')
                screenshot_intervals[int(start_pos):int(end_pos)] = rec.chrom
                if write_first_time:
                    if colorby_ins_size:
                        out.write('colorby INSERT_SIZE\n')
                    if groupby_pair_orientation:
                        out.write('group PAIR_ORIENTATION\n')
                    if viewaspairs:
                        out.write('viewaspairs\n')
                    out.write('maxPanelHeight 1000\n')
                    out.write('snapshotDirectory ' + igv_screenshot_dir + '\n')
                    write_first_time = False
                out.write('collapse\n')
                out.write(f'snapshot {rec.chrom}_{rec.start}_{rec.stop}_{svtype}.png\n')
    out.close()

generate_script(input_vcfs=['../output/sim.vcf'], bam_paths=['../output/sim_sr.bwamem.sorted.bam', '../output/sim_lr.minimap2.sorted.bam'],
                vcf_paths=['../output/sim.vcf'], output_batch_script_path='IGV_screenshots/IGV_batch_script.txt',
                igv_screenshot_dir='./IGV_screenshots', min_svlen=0, max_svlen=250000)

The batch script is populated with some initializing parameters regarding alignment visualization but can be edited according to the [documentation](https://github.com/igvteam/igv/wiki/Batch-commands)

In [None]:
%%sh
cat IGV_screenshots/IGV_batch_script.txt | head -n 20

This call to IGV will populate the `IGV_screenshots/` directory with `.png` images for each record in our input VCF, a subset of which we visualize below.

In [None]:
%%sh
igv -b IGV_screenshots/IGV_batch_script.txt

In [None]:
Image(filename='IGV_screenshots/chr21_23841813_23842113_DEL.png')

In [None]:
Image(filename='IGV_screenshots/chr21_25464823_25465098_DUP.png')

In [None]:
Image(filename='IGV_screenshots/chr21_14269125_14269334_INV.png')

In [None]:
Image(filename='IGV_screenshots/chr21_37672777_37673392_dDUP.png')