# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays

└ @ Revise /Users/biona001/.julia/packages/Revise/439di/src/Revise.jl:1108
┌ Info: Precompiling VCFTools [a620830f-fdd7-5ebc-8d26-3621ab35fbfe]
└ @ Base loading.jl:1273
┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273


# Simulate data

### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 4000 10000 5000000 2e-8 2e-8 2019 > full.vcf
```

Arguments: 
+ Number of haplotypes = 40000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files

+ `haplo_ref.vcf.gz`: haplotype reference files
+ `target.vcf.gz`: complete genotype information
+ `target_masked.vcf.gz`: the same as `target.vcf.gz` except some entries are masked

In [6]:
records, samples = nrecords("full.vcf"), nsamples("full.vcf")
@show records
@show samples;

# compute target and reference index
tgt_index = falses(samples)
tgt_index[samples-999:end] .= true
ref_index = .!tgt_index
record_index = trues(records) # save all records (SNPs) 

# create target.vcf.gz and haplo_ref.vcf.gz
@time VCFTools.filter("full.vcf", record_index, tgt_index, des = "target.vcf.gz")
@time VCFTools.filter("full.vcf", record_index, ref_index, des = "haplo_ref.vcf.gz")

# import full target matrix. Also transpose so that columns are samples. 
@time X = convert_gt(Float32, "target.vcf.gz"; as_minorallele=false)
X = copy(X')

# mask 10% entries
p, n = size(X)
Random.seed!(123)
missingprop = 0.1
X .= ifelse.(rand(Float32, p, n) .< missingprop, missing, X)
masks = ismissing.(X)

# save X to new VCF file
mask_gt("target.vcf.gz", masks, des="target_masked.vcf.gz")

records = 35897
samples = 2000
 28.954614 seconds (395.78 M allocations: 33.237 GiB, 10.83% gc time)
 27.976343 seconds (395.78 M allocations: 33.237 GiB, 10.68% gc time)
  4.427550 seconds (72.59 M allocations: 6.247 GiB, 14.59% gc time)


### Step 3: Read data after filtering

In [7]:
X_complete = convert_gt(Float32, "target.vcf.gz"; as_minorallele=false)
Xm = convert_gt(Float32, "target_masked.vcf.gz"; as_minorallele=false)
Xm_original = copy(Xm)

1000×35897 Array{Union{Missing, Float32},2}:
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0        missing      missing  0.0        missing
 0.0       0.0        missing  0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0        missing  …  0.0       0.0       0.0     
 0.0        missing  0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           missing  0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0        missing
  missing  1.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 ⋮    

# MendelImpute error

In [20]:
# impute
tgtfile = "target_masked.vcf.gz"
reffile = "haplo_ref.vcf.gz"
outfile = "imputed_target.vcf.gz"
width   = 400
@time hs, ph = phase(tgtfile, reffile, impute=true, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, as_minorallele=false)
missing_idx    = ismissing.(Xm_original)
total_missing  = sum(missing_idx)
actual_missing_values  = X_complete[missing_idx]  #true values of missing entries
imputed_missing_values = X_mendel[missing_idx] #imputed values of missing entries
error_rate = sum(actual_missing_values .!= imputed_missing_values) / total_missing

 30.918348 seconds (256.69 M allocations: 25.196 GiB, 10.61% gc time)


0.002375038349728364

# Beagle 5 error

In [10]:
run(`java -jar beagle.28Sep18.793.jar gt=target_masked.vcf.gz ref=haplo_ref.vcf.gz out=beagle.result`)

beagle.28Sep18.793.jar (version 5.0)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.28Sep18.793.jar" to list command line argument
Start time: 02:36 PM PST on 06 Dec 2019

Command line: java -Xmx3641m -jar beagle.28Sep18.793.jar
  gt=target_masked.vcf.gz
  ref=haplo_ref.vcf.gz
  out=beagle.result

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:       1,000
Study samples:           1,000

Window 1 (1:36-4999683)
Reference markers:      35,897
Study markers:          35,897

Burnin  iteration 1:           52 seconds
Burnin  iteration 2:           32 seconds
Burnin  iteration 3:           13 seconds
Burnin  iteration 4:           22 seconds
Burnin  iteration 5:           26 seconds
Burnin  iteration 6:           26 seconds

Phasing iteration 1:           14 seconds
Phasing iteration 2:           19 seconds
Phasing iteration 3:           15 seconds
Phasing iteration 4:           13 seconds
Phasing iteration 5:           14 seconds
Phasing iteration 6: 

Process(`[4mjava[24m [4m-jar[24m [4mbeagle.28Sep18.793.jar[24m [4mgt=target_masked.vcf.gz[24m [4mref=haplo_ref.vcf.gz[24m [4mout=beagle.result[24m`, ProcessExited(0))

In [16]:
# import beagle 5 result
Xb = convert_gt(Float32, "beagle.result.vcf.gz")
Xb = copy(Xb')

# beagle 5 error rate
missing_idx    = ismissing.(Xm_original)
total_missing  = sum(missing_idx)
actual_missing_values  = convert(Vector{Int64}, X[missing_idx])  #true values of missing entries
imputed_missing_values = convert(Vector{Int64}, Xb[missing_idx]) #imputed values of missing entries
error_rate = sum(actual_missing_values .!= imputed_missing_values) / total_missing

0.00022299914642274292

In [18]:
total_error = sum(Xb .!= X)

801