# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

In compare 2, we distinguish typed and untyped SNPs. 10% of typed SNPs are missing and only 50% of all SNPs are typed. 

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays
using JLD2, FileIO, JLSO
using ProgressMeter
using GroupSlices
using ThreadPools
using LinearAlgebra
using CSV
using UnicodePlots
# using Plots
# using ProfileView

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1278


In [13]:
# impute with double bkpt search (8 thread)
Random.seed!(2020)
d       = 1000
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.maxd$d.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, trans=true)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz", trans=true)
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
# rm(outfile, force=true)

Number of threads = 8
Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m


Float32[NaN, 0.740699, 0.0, 0.0, NaN, 0.51937985, 0.0, 0.0, NaN, NaN, -6.231189f-37, 4.5757f-41, 0.48044693, 0.59598213, 0.8111732, 0.80680573, NaN, 0.95583236, 4.420361f35, 4.5757f-41]

Float32[0.740699, 0.740699, 0.63003945, 0.63003945, 0.63003945, 0.51937985, 0.4999134, 0.4999134, 0.4999134, 0.4999134, 0.4999134, 0.4999134, 0.48044693, 0.59598213, 0.8111732, 0.80680573, 0.88131905, 0.95583236, 0.9430847, 0.9430847]

Total windows = 64, averaging ~ 692 unique haplotypes per window.

Timings: 
    Data import                     = 8.85105 seconds
        import target data             = 8.24312 seconds
        import compressed haplotypes   = 0.607925 seconds
    Computing haplotype pair        = 2.22866 seconds
        BLAS3 mul! to get M and N      = 0.0692214 seconds per thread
        haplopair search               = 1.76287 seconds per thread
        initializing missing           = 0.163842 seconds per thread
        allocating and viewing         = 0.0252175 seconds per thread


In [28]:
using GeneticVariation
reader = VCF.Reader(openvcf(outfile, "r"))
snpscores = zeros(nrecords(outfile))

# loop over SNPs
for (i, record) in enumerate(reader)
    snpscores[i] = parse(Float64, VCF.info(record)[1].second)
end


In [29]:
histogram(snpscores)

[90m                ┌                                        ┐[39m 
   [0m[90m[[0m0.4 [90m, [0m0.45[90m)[0m[90m ┤[39m[0m 1                                      [90m [39m 
   [0m[90m[[0m0.45[90m, [0m0.5 [90m)[0m[90m ┤[39m[32m▇▇▇▇[39m[0m 1521                               [90m [39m 
   [0m[90m[[0m0.5 [90m, [0m0.55[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 5062                     [90m [39m 
   [0m[90m[[0m0.55[90m, [0m0.6 [90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 4189                       [90m [39m 
   [0m[90m[[0m0.6 [90m, [0m0.65[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇[39m[0m 3991                        [90m [39m 
   [0m[90m[[0m0.65[90m, [0m0.7 [90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 4903                     [90m [39m 
   [0m[90m[[0m0.7 [90m, [0m0.75[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 7854             [90m [39m 
   [0m[90m[[0m0.75[90m, [0m0.8 [90m)[0m[90m ┤[39m[32m▇▇

In [36]:
# impute with double bkpt search (8 thread)
Random.seed!(2020)
d       = 1000
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.maxd$d.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, trans=true)
# X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz", trans=true)
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
# rm(outfile, force=true)

Number of threads = 8
Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:10[39m


Total windows = 64, averaging ~ 692 unique haplotypes per window.

Timings: 
    Data import                     = 11.154 seconds
        import target data             = 10.3364 seconds
        import compressed haplotypes   = 0.817555 seconds
    Computing haplotype pair        = 2.46339 seconds
        BLAS3 mul! to get M and N      = 0.0690621 seconds per thread
        haplopair search               = 1.98729 seconds per thread
        initializing missing           = 0.170655 seconds per thread
        allocating and viewing         = 0.0228817 seconds per thread
        index conversion               = 0.0344565 seconds per thread
    Phasing by win-win intersection = 0.325044 seconds
        Window-by-window intersection  = 0.0279545 seconds per thread
        Breakpoint search              = 0.221404 seconds per thread
        Recording result               = 0.00508302 seconds per thread
    Imputation                     = 1.26599 seconds
        Imputing missing            

# Simulate data

### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 40000 10000 10000000 2e-8 2e-8 2020 > full.vcf
```

Arguments: 
+ Number of haplotypes = 40000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files

In [11]:
cd("./compare2/")
function filter_and_mask(maf)
    # filter chromosome data for unique snps
#     println("filtering for unique snps")
#     data = "full.vcf"
#     full_record_index = .!find_duplicate_marker(data)
#     @time VCFTools.filter(data, full_record_index, 1:nsamples(data), 
#         des = "full.uniqueSNPs.vcf.gz")

    # summarize data
    println("summarizing data")
    total_snps, samples, _, _, _, maf_by_record, _ = gtstats("full.uniqueSNPs.vcf.gz")

    # keep snps with at least 5 copies of minor alleles
    snps_tokeep = findall(x -> x ≥ 5 / 2samples, maf_by_record)
    
    # generate target panel with all snps
    println("generating complete target panel")
    n = 1000
    sample_idx = falses(samples)
    sample_idx[1:n] .= true
    shuffle!(sample_idx)
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", snps_tokeep, 
        sample_idx, des = "target.full.vcf.gz", allow_multiallelic=false)

    # also generate reference panel without target samples
    println("generating reference panel without target samples")
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", snps_tokeep, 
        .!sample_idx, des = "ref.excludeTarget.vcf.gz", allow_multiallelic=false)

    # generate target file with 1000 samples and typed snps with certain maf
    println("generating target file with typed snps only")
    my_maf = findall(x -> x > maf, maf_by_record)  
    p = length(my_maf)
    record_idx = falses(total_snps)
    record_idx[my_maf] .= true
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", record_idx, sample_idx, 
        des = "target.typedOnly.maf$maf.vcf.gz", allow_multiallelic=false)

    # unphase and mask 1% entries in target file
    println("unphasing and masking entries in target file with typed snps only")
    masks = falses(p, n)
    missingprop = 0.1
    for j in 1:n, i in 1:p
        rand() < missingprop && (masks[i, j] = true)
    end
    @time mask_gt("target.typedOnly.maf$maf.vcf.gz", masks, 
        des="target.typedOnly.maf$maf.masked.vcf.gz", unphase=true)

    # finally compress reference file to jlso format
    d = 1000
    reffile = "ref.excludeTarget.vcf.gz"
    tgtfile = "target.typedOnly.maf$maf.masked.vcf.gz"
    outfile = "ref.excludeTarget.maxd$d.jlso"
    H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, reffile, trans=true, save_snp_info=true, msg="importing reference data...")
    X, X_sampleID, X_chr, X_pos, X_ids, X_ref, X_alt = VCFTools.convert_gt(UInt8, tgtfile, trans=true, save_snp_info=true, msg = "Importing genotype file...")
    @time compress_haplotypes(H, X, outfile, X_pos, H_sampleID, H_chr, H_pos, H_ids, 
        H_ref, H_alt, d, 0, 0.0)
end
Random.seed!(2020)
maf = 0.01
@time filter_and_mask(maf)

summarizing data


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:10:04[39m


generating complete target panel


[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:10:15[39m


626.713593 seconds (5.62 G allocations: 597.236 GiB, 11.93% gc time)
generating reference panel without target samples


[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:18:46[39m


1136.637153 seconds (9.58 G allocations: 940.144 GiB, 14.83% gc time)
generating target file with typed snps only


[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:12:32[39m


765.817117 seconds (5.51 G allocations: 581.825 GiB, 13.24% gc time)
unphasing and masking entries in target file with typed snps only


[32mmasking vcf file...100%|████████████████████████████████| Time: 0:00:07[39m


  8.084801 seconds (73.79 M allocations: 5.496 GiB, 4.69% gc time)


LoadError: UndefVarError: d not defined

In [2]:
# compress reference file to jlso format
max_d    = [1000] # max haplotypes per window
reffile  = "./compare2/ref.excludeTarget.vcf.gz"
tgtfile  = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, reffile, trans=true, save_snp_info=true, msg="importing reference data...")
X, X_sampleID, X_chr, X_pos, X_ids, X_ref, X_alt = VCFTools.convert_gt(UInt8, tgtfile, trans=true, save_snp_info=true, msg = "Importing genotype file...")
for d in max_d
    outfile = "./compare2/ref.excludeTarget.maxd$d.jlso"
    @time compress_haplotypes(H, X, outfile, X_pos, H_sampleID, H_chr, H_pos, H_ids, 
        H_ref, H_alt, d, 0, 0.0)
end

[32mimporting reference data...100%|████████████████████████| Time: 0:03:14[39m
[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m


151.907803 seconds (26.11 M allocations: 2.695 GiB, 0.27% gc time)


In [34]:
# load jlso
@time loaded = JLSO.load("./compare2/ref.excludeTarget.maxd1000.jlso")
compressed_Hunique = loaded[:compressed_Hunique];

  0.830604 seconds (1.76 M allocations: 154.475 MiB)


In [32]:
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.vcf.gz"
@show nrecords(tgtfile), nsamples(tgtfile)
@show nrecords(reffile), nsamples(reffile);

(nrecords(tgtfile), nsamples(tgtfile)) = (36874, 1000)
(nrecords(reffile), nsamples(reffile)) = (89913, 19000)


# MendelImpute with dynamic programming

In [22]:
# keep best pair only (1 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:11[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:01:12[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.07908 seconds
    Computing haplotype pair        = 11.4215 seconds
        BLAS3 mul! to get M and N      = 0.209483 seconds per thread
        haplopair search               = 9.27352 seconds per thread
        supplying constant terms       = 0.0368894 seconds per thread
        finding redundant happairs     = 0.800724 seconds per thread
    Phasing by dynamic programming  = 72.6589 seconds
    Imputation                      = 7.44617 seconds

 99.605137 seconds (77.18 M allocations: 7.509 GiB, 1.09% gc time)
error_rate = 0.0003560219323123464


In [3]:
# keep best pair only (8 threads)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:11[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.06139 seconds
    Computing haplotype pair        = 3.04445 seconds
        BLAS3 mul! to get M and N      = 0.0601239 seconds per thread
        haplopair search               = 2.45815 seconds per thread
        supplying constant terms       = 0.00534291 seconds per thread
        finding redundant happairs     = 0.155477 seconds per thread
    Phasing by dynamic programming  = 11.9017 seconds
    Imputation                      = 7.78007 seconds

 30.787638 seconds (77.18 M allocations: 7.555 GiB, 3.16% gc time)
error_rate = 0.0003560219323123464


# MendelImpute with intersecting haplotype sets

In [5]:
# impute only (8 thread)
Random.seed!(2020)
d       = 1000
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.maxd$d.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, max_d = d,
    dynamic_programming = false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:09[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:08[39m


Total windows = 64, averaging ~ 692 unique haplotypes per window.

Timings: 
    Data import                     = 11.2971 seconds
    Computing haplotype pair        = 3.34432 seconds
        BLAS3 mul! to get M and N      = 0.102085 seconds per thread
        haplopair search               = 2.82811 seconds per thread
        initializing missing           = 0.211092 seconds per thread
        allocating and viewing         = 0.0287914 seconds per thread
        index conversion               = 0.000847365 seconds per thread
    Phasing by win-win intersection = 0.0654008 seconds
        Window-by-window intersection  = 0.0452051 seconds per thread
        Breakpoint search              = 0.00805598 seconds per thread
        Recording result               = 0.00531286 seconds per thread
    Imputation                      = 9.21823 seconds

 23.926207 seconds (77.16 M allocations: 6.401 GiB, 4.77% gc time)
error_rate = 7.199181430938797e-5


In [7]:
# phase (8 thread)
Random.seed!(2020)
d       = 1000
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.maxd$d.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, max_d = d,
    dynamic_programming = false, phase=true);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:08[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:08[39m


Total windows = 64, averaging ~ 692 unique haplotypes per window.

Timings: 
    Data import                     = 10.3269 seconds
    Computing haplotype pair        = 2.79167 seconds
        BLAS3 mul! to get M and N      = 0.0938853 seconds per thread
        haplopair search               = 2.33387 seconds per thread
        initializing missing           = 0.1796 seconds per thread
        allocating and viewing         = 0.0255029 seconds per thread
        index conversion               = 0.0006521 seconds per thread
    Phasing by win-win intersection = 0.0448282 seconds
        Window-by-window intersection  = 0.0314254 seconds per thread
        Breakpoint search              = 0.00556094 seconds per thread
        Recording result               = 0.0035269 seconds per thread
    Imputation                      = 9.05471 seconds

 22.651812 seconds (78.25 M allocations: 6.328 GiB, 4.97% gc time)
error_rate = 9.775004726791454e-5


# Rescreen

In [4]:
# keep best pair only (8 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, rescreen=true);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mPhasing chunk 1/1...100%|███████████████████████████████| Time: 0:00:14[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.07889 seconds
    Computing haplotype pair        = 14.6762 seconds
        BLAS3 mul! to get M and N      = 0.0546491 seconds per thread
        haplopair search               = 1.54516 seconds per thread
        min least sq on observed data  = 0.0497179 seconds per thread
        finding redundant happairs     = 0.0706418 seconds per thread
    Phasing by win-win intersection = 0.289549 seconds
    Imputation                      = 6.7911 seconds

 29.900606 seconds (77.39 M allocations: 7.481 GiB, 4.44% gc time)
error_rate = 8.100052272752551e-5


# Screen for flanking windows

In [3]:
# keep best pair only (8 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.69623 seconds
    Computing haplotype pair        = 3.37601 seconds
        BLAS3 mul! to get M and N      = 0.0963943 seconds per thread
        haplopair search               = 2.89449 seconds per thread
        finding redundant happairs     = 0.0482441 seconds per thread
    Phasing by win-win intersection = 0.447916 seconds
    Imputation                      = 6.62697 seconds

 19.876269 seconds (77.16 M allocations: 7.417 GiB, 5.39% gc time)
error_rate = 7.657402155416903e-5


# Haplotype thinning

In [None]:
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=100, max_haplotypes=100);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m


In [5]:
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=100, max_haplotypes=100);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.57389 seconds
    Computing haplotype pair        = 3.81197 seconds
        screening for top haplotypes   = 0.738086 seconds per thread
        BLAS3 mul! to get M and N      = 2.67268 seconds per thread
        haplopair search               = 0.164605 seconds per thread
        initializing missing           = 0.166272 seconds per thread
        allocating and viewing         = 0.0125657 seconds per thread
        index conversion               = 0.00055434 seconds per thread
    Phasing by win-win intersection = 0.0602829 seconds
        Window-by-window intersection  = 0.0470093 seconds per thread
        Breakpoint search              = 0.00641138 seconds per thread
        Recording result               = 0.00341723 seconds per thread
    Imputation                      = 7.06285 seconds

 19.509919 seconds (76.67 M allocations: 6.657 GiB, 5.16% gc time)
error_rat

# Try stepwise search heuristic

In [6]:
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, max_haplotypes=100, stepwise=100);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

Importing reference haplotype data...


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 7.93711 seconds
    Computing haplotype pair        = 1.14714 seconds
        BLAS3 mul! to get M and N      = 0.073336 seconds per thread
        haplopair search               = 0.881813 seconds per thread
        initializing missing           = 0.126261 seconds per thread
        allocating and viewing         = 0.0130243 seconds per thread
        index conversion               = 0.000442667 seconds per thread
    Phasing by win-win intersection = 0.051122 seconds
        Window-by-window intersection  = 0.0405628 seconds per thread
        Breakpoint search              = 0.00488876 seconds per thread
        Recording result               = 0.00270064 seconds per thread
    Imputation                      = 6.45776 seconds

 15.594449 seconds (76.67 M allocations: 6.378 GiB, 5.74% gc time)
error_rate = 8.563833928352964e-5


# Beagle 5.1

In [None]:
# convert to bref3 (run in terminal)
java -jar ../bref3.18May20.d20.jar ref.excludeTarget.vcf.gz > ref.excludeTarget.bref3 

In [31]:
# run beagle 5 (8 thread)
run(`java -jar beagle.18May20.d20.jar gt=compare2/target.typedOnly.maf0.01.masked.vcf.gz ref=compare2/ref.excludeTarget.bref3 out=compare2/beagle.result nthreads=8`)

# beagle 5 error rate
X_complete = convert_gt(Float32, "compare2/target.full.vcf.gz")
n, p = size(X_complete)
X_beagle = convert_gt(Float32, "compare2/beagle.result.vcf.gz")
error_rate = sum(X_beagle .!= X_complete) / n / p

beagle.18May20.d20.jar (version 5.1)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.18May20.d20.jar" to list command line argument
Start time: 01:01 AM PDT on 30 Jun 2020

Command line: java -Xmx3641m -jar beagle.18May20.d20.jar
  gt=compare2/target.typedOnly.maf0.01.masked.vcf.gz
  ref=compare2/ref.excludeTarget.bref3
  out=compare2/beagle.result
  nthreads=8

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:      19,000
Study samples:           1,000

Window 1 (1:34-9999816)
Reference markers:      89,913
Study markers:          36,874

Burnin  iteration 1:           36 seconds
Burnin  iteration 2:           32 seconds
Burnin  iteration 3:           35 seconds
Burnin  iteration 4:           43 seconds
Burnin  iteration 5:           37 seconds
Burnin  iteration 6:           1 minute 16 seconds

Phasing iteration 1:           2 minutes 0 seconds
Phasing iteration 2:           41 seconds
Phasing iteration 3:           38 seconds
Phasing iteration 4:  

1.7794979591382782e-5

# Minimac4 error

Need to first convert reference vcf file to m3vcf using minimac3 (on Hoffman)

```Julia
minimac3 = "/u/home/b/biona001/haplotype_comparisons/minimac3/Minimac3/bin/Minimac3"
@time run(`$minimac3 --refHaps haplo_ref.vcf.gz --processReference --prefix haplo_ref`)
```

In [None]:
# use eagle 2.4 for prephasing

In [None]:
# run minimac 4
minimac4 = "/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4"
run(`$minimac4 --refHaps haplo_ref.m3vcf.gz --haps target_masked.vcf.gz --prefix minimac4.result`)
    
X_minimac = convert_gt(Float32, "minimac4.result.dose.vcf.gz", as_minorallele=false)
error_rate = sum(X_minimac .!= X_complete) / n / p

# BLAS 3

In [2]:
# account for allele frequency (8 thread, BLAS3)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:19[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 11.2299 seconds
    Computing haplotype pair        = 20.639 seconds
        computing dist(X, H)           = 0.10026 seconds per thread
        BLAS3 mul! to get M and N      = 0.0936877 seconds per thread
        haplopair search               = 13.9145 seconds per thread
        finding redundant happairs     = 0.0360827 seconds per thread
    Phasing by win-win intersection = 1.08454 seconds
    Imputation                      = 8.11408 seconds

 59.805070 seconds (161.01 M allocations: 12.053 GiB, 4.43% gc time)
error_rate = 8.47485903039605e-5


# BLAS 2

In [None]:
# account for allele frequency (8 thread, BLAS3)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs... 71%|██████████▋    |  ETA: 0:01:21[39m