# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

In compare 3, we increase recombination and mutation rate by 3x. This supposedly increases number of unique haplotypes per window. 

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays
using JLD2, FileIO, JLSO
using ProgressMeter
using GroupSlices
using ThreadPools
# using Plots
# using ProfileView

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273
│ - If you have MendelImpute checked out for development and have
│   added Lasso as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with MendelImpute


# Simulate data

### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 40000 10000 10000000 6e-8 6e-8 2020 > full.vcf
```

Arguments: 
+ Number of haplotypes = 40000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files

In [9]:
cd("./compare3/")
function filter_and_mask(maf)
    # filter chromosome data for unique snps
    println("filtering for unique snps")
    data = "full.vcf"
    full_record_index = .!find_duplicate_marker(data)
    @time VCFTools.filter(data, full_record_index, 1:nsamples(data), 
        des = "full.uniqueSNPs.vcf.gz")

    # summarize data
    println("summarizing data")
    total_snps, samples, _, _, _, maf_by_record, _ = gtstats("full.uniqueSNPs.vcf.gz")

    # generate target panel with all snps
    println("generating complete target panel")
    n = 1000
    sample_idx = falses(samples)
    sample_idx[1:n] .= true
    shuffle!(sample_idx)
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", 1:total_snps, 
        sample_idx, des = "target.full.vcf.gz", allow_multiallelic=false)

    # also generate reference panel without target samples
    println("generating reference panel without target samples")
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", 1:total_snps, 
        .!sample_idx, des = "ref.excludeTarget.vcf.gz", allow_multiallelic=false)

    # generate target file with 1000 samples and typed snps with certain maf
    println("generating target file with typed snps only")
    my_maf = findall(x -> x > maf, maf_by_record)  
    p = length(my_maf)
    record_idx = falses(total_snps)
    record_idx[my_maf] .= true
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", record_idx, sample_idx, 
        des = "target.typedOnly.maf$maf.vcf.gz", allow_multiallelic=false)

    # unphase and mask 1% entries in target file
    println("unphasing and masking entries in target file with typed snps only")
    masks = falses(p, n)
    missingprop = 0.1
    for j in 1:n, i in 1:p
        rand() < missingprop && (masks[i, j] = true)
    end
    @time mask_gt("target.typedOnly.maf$maf.vcf.gz", masks, 
        des="target.typedOnly.maf$maf.masked.vcf.gz", unphase=true)

    # finally compress reference file to jlso format
    widths  = [32, 64, 128, 256, 512]
    reffile = "ref.excludeTarget.vcf.gz"
    tgtfile = "target.typedOnly.maf$maf.masked.vcf.gz"
    H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, reffile, trans=true, save_snp_info=true, msg="importing reference data...")
    X, X_sampleID, X_chr, X_pos, X_ids, X_ref, X_alt = VCFTools.convert_gt(UInt8, tgtfile, trans=true, save_snp_info=true, msg = "Importing genotype file...")
    for width in widths
        outfile = "ref.excludeTarget.w$width.jlso"
        @time compress_haplotypes(H, X, outfile, X_pos, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt, width)
    end
end
Random.seed!(2020)
maf = 0.1
@time filter_and_mask(maf)

filtering for unique snps
755.465285 seconds (10.75 G allocations: 800.675 GiB, 6.46% gc time)
summarizing data


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:26:35[39m


generating complete target panel
612.728196 seconds (12.09 G allocations: 949.884 GiB, 10.88% gc time)
generating reference panel without target samples
1765.728035 seconds (36.27 G allocations: 2.454 TiB, 13.02% gc time)
generating target file with typed snps only
538.035998 seconds (11.01 G allocations: 830.194 GiB, 12.75% gc time)
unphasing and masking entries in target file with typed snps only
 17.902879 seconds (106.60 M allocations: 7.939 GiB, 4.67% gc time)


[32mimporting reference data...100%|████████████████████████| Time: 0:19:49[39m
[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:14[39m


441.226699 seconds (14.22 M allocations: 38.447 GiB, 3.34% gc time)
362.697065 seconds (5.41 M allocations: 20.743 GiB, 2.36% gc time)
242.157296 seconds (4.88 M allocations: 11.718 GiB, 0.77% gc time)
219.485904 seconds (4.28 M allocations: 7.015 GiB, 0.67% gc time)
221.963330 seconds (3.65 M allocations: 4.597 GiB, 0.52% gc time)
9622.542706 seconds (112.85 G allocations: 9.120 TiB, 10.33% gc time)


In [19]:
# load jlso
@time loaded = JLSO.load("ref.excludeTarget.w512.jlso")
compressed_Hunique = loaded[:compressed_Hunique];

  0.755665 seconds (1.72 M allocations: 203.743 MiB)


In [13]:
tgtfile = "./compare3/target.typedOnly.maf0.1.masked.vcf.gz"
reffile = "./compare3/ref.excludeTarget.vcf.gz"
@show nrecords(tgtfile), nsamples(tgtfile)
@show nrecords(reffile), nsamples(reffile);

(nrecords(tgtfile), nsamples(tgtfile)) = (53267, 1000)
(nrecords(reffile), nsamples(reffile)) = (268629, 19000)


# MendelImpute with dynamic programming

In [5]:
# keep best pair only (1 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare3/target.typedOnly.maf0.1.masked.vcf.gz"
reffile = "./compare3/ref.excludeTarget.w$width.jlso"
outfile = "./compare3/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare3/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:10[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:11[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:15[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:18[39m


Total windows = 104, averaging ~ 1121 unique haplotypes per window.

Timings: 
    Data import                     = 13.2537 seconds
    Computing haplotype pair        = 11.8286 seconds
        BLAS3 mul! to get M and N      = 0.196211 seconds per thread
        haplopair search               = 10.0515 seconds per thread
        supplying constant terms       = 0.00836884 seconds per thread
        finding redundant happairs     = 0.214265 seconds per thread
    Phasing by dynamic programming  = 15.8494 seconds
    Imputation                      = 23.2537 seconds

 64.185470 seconds (114.20 M allocations: 12.498 GiB, 4.24% gc time)
error_rate = 0.00022427958262138486


# MendelImpute with intersecting haplotype sets

In [3]:
# keep best pair only (1 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare3/target.typedOnly.maf0.1.masked.vcf.gz"
reffile = "./compare3/ref.excludeTarget.w$width.jlso"
outfile = "./compare3/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare3/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:11[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:12[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:19[39m


Total windows = 104, averaging ~ 1121 unique haplotypes per window.

Timings: 
    Data import                     = 14.1737 seconds
    Computing haplotype pair        = 12.6018 seconds
        BLAS3 mul! to get M and N      = 0.247583 seconds per thread
        haplopair search               = 10.0533 seconds per thread
        supplying constant terms       = 0.00820274 seconds per thread
        finding redundant happairs     = 0.037361 seconds per thread
    Phasing by dynamic programming  = 0.713093 seconds
    Imputation                      = 23.6164 seconds

 51.105105 seconds (113.78 M allocations: 12.547 GiB, 6.59% gc time)
error_rate = 5.653894404550514e-5


# Try Lasso

In [2]:
# keep best pair only (1 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare3/target.typedOnly.maf0.1.masked.vcf.gz"
reffile = "./compare3/ref.excludeTarget.w$width.jlso"
outfile = "./compare3/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, lasso=true);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare3/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:11[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:08:55[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:20[39m


Total windows = 104, averaging ~ 1121 unique haplotypes per window.

Timings: 
    Data import                     = 16.9048 seconds
    Computing haplotype pair        = 535.978 seconds
        BLAS3 mul! to get M and N      = 0.0 seconds per thread
        haplopair search               = 0.0 seconds per thread
        finding redundant happairs     = 0.394146 seconds per thread
    Phasing by win-win intersection = 2.07432 seconds
    Imputation                      = 24.3816 seconds

603.856542 seconds (257.25 M allocations: 1.838 TiB, 30.53% gc time)
error_rate = 0.001078141228236713


# Beagle 5.1 Error

In [None]:
# convert to bref3 (run in terminal)
java -jar ../bref3.18May20.d20.jar ref.excludeTarget.vcf.gz > ref.excludeTarget.bref3 

In [3]:
# run beagle 5 (8 thread)
run(`java -jar beagle.18May20.d20.jar gt=./compare3/target.typedOnly.maf0.1.masked.vcf.gz ref=compare3/ref.excludeTarget.bref3 out=compare3/beagle.result nthreads=8`)

# beagle 5 error rate
X_complete = convert_gt(Float32, "compare3/target.full.vcf.gz")
n, p = size(X_complete)
X_beagle = convert_gt(Float32, "compare3/beagle.result.vcf.gz")
error_rate = sum(X_beagle .!= X_complete) / n / p

beagle.18May20.d20.jar (version 5.1)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.18May20.d20.jar" to list command line argument
Start time: 06:29 PM PDT on 30 Jun 2020

Command line: java -Xmx3641m -jar beagle.18May20.d20.jar
  gt=./compare3/target.typedOnly.maf0.1.masked.vcf.gz
  ref=compare3/ref.excludeTarget.bref3
  out=compare3/beagle.result
  nthreads=8

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:      19,000
Study samples:           1,000

Window 1 (1:7-9999996)
Reference markers:     268,629
Study markers:          53,267

Burnin  iteration 1:           32 seconds
Burnin  iteration 2:           49 seconds
Burnin  iteration 3:           50 seconds
Burnin  iteration 4:           50 seconds
Burnin  iteration 5:           49 seconds
Burnin  iteration 6:           1 minute 27 seconds

Phasing iteration 1:           1 minute 45 seconds
Phasing iteration 2:           52 seconds
Phasing iteration 3:           51 seconds
Phasing iteration 4:  

1.9781929724638815e-5

# Minimac4 error

Need to first convert reference vcf file to m3vcf using minimac3 (on Hoffman)

```Julia
minimac3 = "/u/home/b/biona001/haplotype_comparisons/minimac3/Minimac3/bin/Minimac3"
@time run(`$minimac3 --refHaps haplo_ref.vcf.gz --processReference --prefix haplo_ref`)
```

In [None]:
# use eagle 2.4 for prephasing

In [None]:
# run minimac 4
minimac4 = "/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4"
run(`$minimac4 --refHaps haplo_ref.m3vcf.gz --haps target_masked.vcf.gz --prefix minimac4.result`)
    
X_minimac = convert_gt(Float32, "minimac4.result.dose.vcf.gz", as_minorallele=false)
error_rate = sum(X_minimac .!= X_complete) / n / p