# Systematic comparison of MendelImpute against Minimac4 and Beagle5

In [8]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using Suppressor

# Simulate data of various sizes 

## Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


## Step 1. Simulate data in terminal

```
python3 msprime_script.py 50000 10000 10000000 2e-8 2e-8 2020 > ./compare_sys/data1.vcf
python3 msprime_script.py 50000 10000 10000000 4e-8 4e-8 2020 > ./compare_sys/data2.vcf
python3 msprime_script.py 50000 10000 100000000 2e-8 2e-8 2020 > ./compare_sys/data3.vcf
python3 msprime_script.py 50000 10000 5000000 2e-8 2e-8 2020 > ./compare_sys/data4.vcf
```

In [30]:
@show nsamples("./compare_sys/data1.vcf")
@show nrecords("./compare_sys/data1.vcf")
@show nsamples("./compare_sys/data2.vcf")
@show nrecords("./compare_sys/data2.vcf")
@show nsamples("./compare_sys/data3.vcf")
@show nrecords("./compare_sys/data3.vcf")
@show nsamples("./compare_sys/data4.vcf")
@show nrecords("./compare_sys/data4.vcf")

nsamples("./compare_sys/data1.vcf") = 25000
nrecords("./compare_sys/data1.vcf") = 90127
nsamples("./compare_sys/data2.vcf") = 25000
nrecords("./compare_sys/data2.vcf") = 182414
nsamples("./compare_sys/data3.vcf") = 25000
nrecords("./compare_sys/data3.vcf") = 909708


909708

###  Arguments: 
+ Number of haplotypes = 40000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

## Step 2: Compress files to .gz

In [34]:
# run these in terminal
run(`cat data1.vcf | gzip > data1.vcf.gz`)
run(`cat data2.vcf | gzip > data2.vcf.gz`)
run(`cat data3.vcf | gzip > data3.vcf.gz`)

ProcessFailedException: failed process: Process(`cat ./compare_sys/data3.vcf '|' gzip '>' ./compare_sys/data3.vcf.gz`, ProcessSignaled(2)) [0]


In [32]:
compress_vcf_to_gz("./compare_sys/data1.vcf"); rm("./compare_sys/data1.vcf", force=true)
compress_vcf_to_gz("./compare_sys/data2.vcf"); rm("./compare_sys/data2.vcf", force=true)
compress_vcf_to_gz("./compare_sys/data3.vcf"); rm("./compare_sys/data3.vcf", force=true)

[32mCreating ./compare_sys/data1.vcf.gz...100%|█████████████| Time: 0:14:14[39m
[32mCreating ./compare_sys/data2.vcf.gz...  0%|             |  ETA: 0:30:23[39m

InterruptException: InterruptException:

## Step 3: filter files

+ `haplo_ref.vcf.gz`: haplotype reference files
+ `target.vcf.gz`: complete genotype information
+ `target_masked.vcf.gz`: the same as `target.vcf.gz` except some entries are masked

In [4]:
records, samples = nrecords("full.vcf"), nsamples("full.vcf")
@show records
@show samples;

# compute target and reference index
tgt_index = falses(samples)
tgt_index[samples-999:end] .= true
ref_index = .!tgt_index
record_index = trues(records) # save all records (SNPs) 

# create target.vcf.gz and haplo_ref.vcf.gz
@time VCFTools.filter("full.vcf", record_index, tgt_index, des = "target.vcf.gz")
@time VCFTools.filter("full.vcf", record_index, ref_index, des = "haplo_ref.vcf.gz")

# import full target matrix. Also transpose so that columns are samples. 
@time X = convert_gt(Float32, "target.vcf.gz"; as_minorallele=false)
X = copy(X')

# mask 10% entries
p, n = size(X)
Random.seed!(123)
missingprop = 0.1
X .= ifelse.(rand(Float32, p, n) .< missingprop, missing, X)
masks = ismissing.(X)

# save X to new VCF file
mask_gt("target.vcf.gz", masks, des="target_masked.vcf.gz")

records = 35897
samples = 2000
 70.297829 seconds (397.28 M allocations: 33.310 GiB, 11.51% gc time)
 67.404677 seconds (395.78 M allocations: 33.237 GiB, 11.69% gc time)
 18.210343 seconds (144.39 M allocations: 12.666 GiB, 17.95% gc time)
