# Data Generation

This notebook documents how to simulate realistic haplotypes using [msprime](https://msprime.readthedocs.io/en/stable/#), then how to processed the result using our [VCFTools.jl](https://github.com/OpenMendel/VCFTools.jl) package.

**Note:** For demonstration purposes, we simulated an *extremely small* reference panel. 

In [2]:
# load necessary packages in Julia
using MendelImpute
using VCFTools
using Random
using UnicodePlots

## Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


## Step 1. Simulate phased haplotypes 

In the data folder, execute the following in the terminal:

```
python3 msprime_script.py 5000 10000 5000000 2e-8 2e-8 2020 > full.vcf
```

Argument meaning: 
+ Number of haplotypes = 5000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 5 million
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2020

The resulting `full.vcf` is a VCF file containing 2500 phased genotypes each with 36063 SNPs.

In [3]:
nsamples("./data/full.vcf"), nrecords("./data/full.vcf")

(2500, 36063)

## Step 2: Convert simulated data to reference and target files

Starting with simulated data `full.vcf`, we use 100 genotypes as imputation targets, and the rest is used as reference panel. Filtering is achieved with utilies in [VCFTools.jl](https://github.com/OpenMendel/VCFTools.jl). We randomly choose 10,000 SNPs with minor allele frequency $\ge 0.05$ as the typed positions. Note data must conform to [MendelImpute's data preparation requirement](https://openmendel.github.io/MendelImpute.jl/dev/man/Phasing+and+Imputation/#Preparing-Target-Data)

In [4]:
# set random seed for reproducibility
Random.seed!(2020)

# simulated data directory
data = "./data/full.vcf"

# remove SNPs with the same positions, keep all samples, save result into new file
SNPs_to_keep = .!find_duplicate_marker(data) 
VCFTools.filter(data, SNPs_to_keep, 1:nsamples(data), des = "./data/uniqueSNPs.vcf.gz")

# summarize data
total_snps, samples, _, _, _, maf_by_record, _ = gtstats("./data/uniqueSNPs.vcf.gz")

# generate target file with 100 samples and 5k snps with maf>0.05
n = 100
p = 5000
record_idx = falses(total_snps)
large_maf = findall(x -> x > 0.05, maf_by_record)  
Random.shuffle!(large_maf)
record_idx[large_maf[1:p]] .= true
sample_idx = falses(samples)
sample_idx[1:n] .= true
Random.shuffle!(sample_idx)
VCFTools.filter("./data/uniqueSNPs.vcf.gz", record_idx, sample_idx, 
    des = "./data/target.typedOnly.vcf.gz", allow_multiallelic=false)

# unphase and mask 1% entries in target file
masks = falses(p, n)
missingprop = 0.01
for j in 1:n, i in 1:p
    rand() < missingprop && (masks[i, j] = true)
end
mask_gt("./data/target.typedOnly.vcf.gz", masks, 
    des="./data/target.typedOnly.masked.vcf.gz", unphase=true)

# generate target panel with all snps (containing true phase and genotypes)
VCFTools.filter("./data/uniqueSNPs.vcf.gz", 1:total_snps, 
    sample_idx, des = "./data/target.full.vcf.gz", allow_multiallelic=false)

# generate reference panel
VCFTools.filter("./data/uniqueSNPs.vcf.gz", 1:total_snps, .!sample_idx, 
    des = "./data/ref.excludeTarget.vcf.gz", allow_multiallelic=false)

[32mfinding duplicate markers...100%|███████████████████████| Time: 0:00:21[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:27[39m
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:23[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:22[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:23[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:44[39m


## Step 3: generating `.jlso` compressed reference panel

MendelImpute requires one to pre-process the reference panel for faster reading. This is achieved via the [compress_haplotypes](https://openmendel.github.io/MendelImpute.jl/dev/man/api/#MendelImpute.compress_haplotypes) function.

In [5]:
reffile = "./data/ref.excludeTarget.vcf.gz"
tgtfile = "./data/target.typedOnly.masked.vcf.gz"
outfile = "./data/ref.excludeTarget.jlso"
@time compress_haplotypes(reffile, tgtfile, outfile)

[32mimporting reference data...100%|████████████████████████| Time: 0:00:11[39m


 26.798756 seconds (205.06 M allocations: 14.641 GiB, 5.74% gc time)


## Output explanation:

You just generated reference and target VCF files:

+ `ref.excludeTarget.jlso`: Compressed reference haplotype panel with 4800 haplotypes (in JLSO format)
+ `target.typedOnly.masked.vcf.gz`: Imputation target file containing 100 samples at 5k SNPs. All genotypes are unphased and contains 1% missing data. 

You also generated/downloaded:

+ `full.vcf`: The original simulated data from `msprime`.
+ `uniqueSNPs.vcf.gz`: This is the original data excluding duplicate records (SNPs) by checking marker positions. 
+ `ref.excludeTarget.vcf.gz`: Reference haplotype panel (in VCF format)
+ `target.full.vcf.gz`: The complete data for imputation target, used for checking imputation accuracy. All genotypes are phased and non-missing. 
+ `target.typedOnly.vcf.gz`: Complete target data on just the typed SNPs. All genotypes are phased and non-missing. Just by-producted for generating other files; not used for anything downstream.

## Statistics on compressed reference panel

`MendelImpute` contains some hidden utility functions to quickly summarize a `.jlso` compressed haplotype reference panel. For instance,

In [6]:
# calculate number of unique haplotypes per window
haps_per_window = MendelImpute.count_haplotypes_per_window("./data/ref.excludeTarget.jlso")

# calculate window width
window_width = MendelImpute.get_window_widths("./data/ref.excludeTarget.jlso");

In [7]:
histogram(haps_per_window)

[90m                  ┌                                        ┐[39m 
   [0m[90m[[0m600.0[90m, [0m650.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇[39m[0m 1                                 [90m [39m 
   [0m[90m[[0m650.0[90m, [0m700.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 4                   [90m [39m 
   [0m[90m[[0m700.0[90m, [0m750.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 8 [90m [39m 
   [0m[90m[[0m750.0[90m, [0m800.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇[39m[0m 2                             [90m [39m 
   [0m[90m[[0m800.0[90m, [0m850.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇[39m[0m 1                                 [90m [39m 
[90m                  └                                        ┘[39m 
[0m                                  Frequency

In [8]:
histogram(window_width)

[90m                  ┌                                        ┐[39m 
   [0m[90m[[0m625.0[90m, [0m626.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 16 [90m [39m 
[90m                  └                                        ┘[39m 
[0m                                  Frequency

**Conclusion:** The compressed reference panel contains 16 windows of approximately 600 typed SNPs each. Within each window, there are approximately 600-800 unique haplotypes per window. 