# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays
using JLD2, FileIO, JLSO
using ProgressMeter
using GroupSlices
# using Plots
# using ProfileView

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273
│ - If you have MendelImpute checked out for development and have
│   added TimerOutputs as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with MendelImpute


# Simulate data

### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 4000 10000 5000000 2e-8 2e-8 2019 > full.vcf
```

Arguments: 
+ Number of haplotypes = 40000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files

+ `haplo_ref.vcf.gz`: haplotype reference files
+ `target.vcf.gz`: complete genotype information
+ `target_masked.vcf.gz`: the same as `target.vcf.gz` except some entries are masked

In [4]:
records, samples = nrecords("full.vcf"), nsamples("full.vcf")
@show records
@show samples;

# compute target and reference index
tgt_index = falses(samples)
tgt_index[samples-999:end] .= true
ref_index = .!tgt_index
record_index = trues(records) # save all records (SNPs) 

# create target.vcf.gz and haplo_ref.vcf.gz
@time VCFTools.filter("full.vcf", record_index, tgt_index, des = "target.vcf.gz")
@time VCFTools.filter("full.vcf", record_index, ref_index, des = "haplo_ref.vcf.gz")

# import full target matrix. Also transpose so that columns are samples. 
@time X = convert_gt(Float32, "target.vcf.gz"; as_minorallele=false)
X = copy(X')

# mask 10% entries
p, n = size(X)
Random.seed!(123)
missingprop = 0.1
X .= ifelse.(rand(Float32, p, n) .< missingprop, missing, X)
masks = ismissing.(X)

# save X to new VCF file
mask_gt("target.vcf.gz", masks, des="target_masked.vcf.gz")

records = 35897
samples = 2000
 70.297829 seconds (397.28 M allocations: 33.310 GiB, 11.51% gc time)
 67.404677 seconds (395.78 M allocations: 33.237 GiB, 11.69% gc time)
 18.210343 seconds (144.39 M allocations: 12.666 GiB, 17.95% gc time)


# Try compressing haplotype ref panels

In [5]:
# compress as jld2
vcffile = "haplo_ref.vcf.gz"
outfile = "haplo_ref.jld2"
width = 500
@time compress_haplotypes(vcffile, outfile, width);

[32mimporting vcf data...100%|██████████████████████████████| Time: 0:00:07[39m


  9.137770 seconds (73.80 M allocations: 5.445 GiB, 6.93% gc time)


In [10]:
# compress to jlso. Need to generate manifest file to directly use compress_haplotypes()
vcffile = "haplo_ref.vcf.gz"
outfile = "haplo_ref.jlso"
width = 500
trans = true
dims = 2
flankwidth = 0
H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, vcffile, trans=trans, save_snp_info=true, msg="importing vcf data...")
snps = (dims == 2 ? size(H, 1) : size(H, 2))
windows = floor(Int, snps / width)
compressed_Hunique = MendelImpute.CompressedHaplotypes(windows, width, snps, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt)
for w in 1:windows
    if w == 1
        cur_range = 1:(width + flankwidth)
    elseif w == windows
        cur_range = ((windows - 1) * width - flankwidth + 1):snps
    else
        cur_range = ((w - 1) * width - flankwidth + 1):(w * width + flankwidth)
    end
    compressed_Hunique.CWrange[w] = cur_range

    H_cur_window = (dims == 2 ? view(H, cur_range, :) : view(H, :, cur_range))
    hapmap = groupslices(H_cur_window, dims=dims)
    unique_idx = unique(hapmap)
    complete_to_unique = indexin(hapmap, unique_idx)
    uniqueH = (dims == 2 ? H_cur_window[:, unique_idx] : H_cur_window[unique_idx, :])
    compressed_Hunique[w] = MendelImpute.CompressedWindow(unique_idx, hapmap, complete_to_unique, uniqueH)
end
JLSO.save(outfile, :compressed_Hunique => compressed_Hunique, format=:julia_serialize, compression=:gzip)

[32mimporting vcf data...100%|██████████████████████████████| Time: 0:00:06[39m


In [12]:
# load jld2
@time @load "haplo_ref.jld2" compressed_Hunique;

  0.069700 seconds (588.27 k allocations: 44.810 MiB)


In [16]:
# load jlso
@time loaded = JLSO.load("haplo_ref.jlso")
compressed_Hunique = loaded[:compressed_Hunique];

  0.157421 seconds (505.38 k allocations: 26.817 MiB)


In [17]:
;ls -al haplo_ref.jld2

-rw-r--r--  1 biona001  staff  15279595 Jun 10 14:59 haplo_ref.jld2


In [19]:
;ls -al haplo_ref.jlso

-rw-r--r--  1 biona001  staff  920022 Jun 10 15:03 haplo_ref.jlso


In [18]:
;ls -al haplo_ref.vcf.gz

-rw-r--r--@ 1 biona001  staff  5449864 Apr  5 19:59 haplo_ref.vcf.gz


# MendelImpute error rate

In [3]:
# search only single breakpoints (1 thread)
Random.seed!(2020)
tgtfile = "target_masked.vcf.gz"
reffile = "haplo_ref.jlso"
outfile = "imputed_target.vcf.gz"
width   = 500
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "target.vcf.gz")
n, p = size(X_mendel)
error_rate = sum(X_mendel .!= X_complete) / n / p

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m


avg_haps = 320.32394366197184


[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:10[39m


[0m[1m ──────────────────────────────────────────────────────────────────────────────[22m
[0m[1m                               [22m        Time                   Allocations      
                               ──────────────────────   ───────────────────────
       Tot / % measured:            24.7s / 100%            7.41GiB / 100%     

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 phasing (haplotypi...      1    10.0s  40.6%   10.0s   61.8MiB  0.81%  61.8MiB
 Import genotype data       1    7.11s  28.8%   7.11s   5.44GiB  73.5%  5.44GiB
 Compute redundant ...      1    4.56s  18.5%   4.56s   1.67GiB  22.6%  1.67GiB
   compute optimal ...     71    3.37s  13.7%  47.5ms    297MiB  3.91%  4.18MiB
   compute redundan...     71    685ms  2.77%  9.64ms    203MiB  2.67%  2.85MiB
   align markers           71    211ms  0.85%  2.97ms    117MiB  1.54%  1.65MiB
 impute step 

0.00012354792879627822

In [3]:
# search only single breakpoints (1 thread)
Random.seed!(2020)
tgtfile = "target_masked.vcf.gz"
reffile = "haplo_ref.jlso"
outfile = "imputed_target.vcf.gz"
width   = 500
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "target.vcf.gz")
n, p = size(X_mendel)
error_rate = sum(X_mendel .!= X_complete) / n / p

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:10[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:25[39m


Data import time                    = 6.6587 seconds
Computing haplotype pair time       = 10.2962 seconds
Phasing by dynamic programming time = 25.9828 seconds
Imputing time                       = 3.6637 seconds
 46.601150 seconds (73.68 M allocations: 7.879 GiB, 2.17% gc time)


0.00010181909351756413

In [5]:
# search only single breakpoints (8 thread)
Random.seed!(2020)
tgtfile = "target_masked.vcf.gz"
reffile = "haplo_ref.jlso"
outfile = "imputed_target.vcf.gz"
width   = 500
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "target.vcf.gz")
n, p = size(X_mendel)
error_rate = sum(X_mendel .!= X_complete) / n / p

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m


Data import time                    = 7.00521 seconds
Computing haplotype pair time       = 2.14302 seconds
Phasing by dynamic programming time = 4.11389 seconds
Imputing time                       = 3.09487 seconds
 16.357013 seconds (73.69 M allocations: 7.939 GiB, 5.88% gc time)


0.00010181909351756413

# Beagle 5 Error

In [15]:
# run beagle 5 and import imputed data 
run(`java -jar beagle.28Sep18.793.jar gt=target_masked.vcf.gz ref=haplo_ref.vcf.gz out=beagle.result`)

# beagle 5 error rate
X_beagle = convert_gt(Float32, "beagle.result.vcf.gz", as_minorallele=false)
error_rate = sum(X_beagle .!= X_complete) / n / p

beagle.28Sep18.793.jar (version 5.0)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.28Sep18.793.jar" to list command line argument
Start time: 11:01 AM PST on 23 Jan 2020

Command line: java -Xmx3641m -jar beagle.28Sep18.793.jar
  gt=target_masked.vcf.gz
  ref=haplo_ref.vcf.gz
  out=beagle.result

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:       1,000
Study samples:           1,000

Window 1 (1:36-4999683)
Reference markers:      35,897
Study markers:          35,897

Burnin  iteration 1:           50 seconds
Burnin  iteration 2:           32 seconds
Burnin  iteration 3:           13 seconds
Burnin  iteration 4:           18 seconds
Burnin  iteration 5:           23 seconds
Burnin  iteration 6:           25 seconds

Phasing iteration 1:           13 seconds
Phasing iteration 2:           13 seconds
Phasing iteration 3:           13 seconds
Phasing iteration 4:           12 seconds
Phasing iteration 5:           12 seconds
Phasing iteration 6: 

2.231384238237179e-5

# Minimac4 error

Need to first convert reference vcf file to m3vcf using minimac3 (on Hoffman)

```Julia
minimac3 = "/u/home/b/biona001/haplotype_comparisons/minimac3/Minimac3/bin/Minimac3"
@time run(`$minimac3 --refHaps haplo_ref.vcf.gz --processReference --prefix haplo_ref`)
```

In [17]:
# run minimac 4
minimac4 = "/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4"
run(`$minimac4 --refHaps haplo_ref.m3vcf.gz --haps target_masked.vcf.gz --prefix minimac4.result`)
    
X_minimac = convert_gt(Float32, "minimac4.result.dose.vcf.gz", as_minorallele=false)
error_rate = sum(X_minimac .!= X_complete) / n / p



 -------------------------------------------------------------------------------- 
          Minimac4 - Fast Imputation Based on State Space Reduction HMM
 --------------------------------------------------------------------------------
           (c) 2014 - Sayantan Das, Christian Fuchsberger, David Hinds
                             Mary Kate Wing, Goncalo Abecasis 

 Version: 1.0.2;
 Built: Mon Sep 30 11:52:22 PDT 2019 by biona001

 Command Line Options: 
       Reference Haplotypes : --refHaps [haplo_ref.m3vcf.gz], --passOnly,
                              --rsid, --referenceEstimates [ON],
                              --mapFile [docs/geneticMapFile.b38.map.txt.gz]
          Target Haplotypes : --haps [target_masked.vcf.gz]
          Output Parameters : --prefix [minimac4.result], --estimate,
                              --nobgzip, --vcfBuffer [200], --format [GT,DS],
                              --allTypedSites, --meta, --memUsage
        Chunking Parameters : --ChunkLengthMb

0.00018399866284090594