# Test imputation on untyped SNPs chrom 20 

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using StatsBase

└ @ Revise /Users/biona001/.julia/packages/Revise/439di/src/Revise.jl:1108
┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273


### Memory requirement

**Prephasing step:** 
+ Target data requies $people * snps * 4$ bytes of RAM
+ Reference haplotype data requires $haplotypes * snps$ bits of RAM
+ Redundant haplotype set for imputation target requires roughly
$people * windows * 1000$ (max haplotypes per win) $* 16 bytes$ of RAM

## Generate subset of markers for prephasing

In [3]:
cd("/Users/biona001/.julia/dev/MendelImpute/data/1000_genome_phase3_v5/filtered")
function filter_and_mask()
    for chr in [20]
        # filter chromosome data for unique snps
        data = "../beagle_raw/chr$chr.1kg.phase3.v5a.vcf.gz"
        full_record_index = .!find_duplicate_marker(data)
        @time VCFTools.filter(data, full_record_index, 1:nsamples(data), 
            des = "chr$chr.uniqueSNPs.vcf.gz")

        # summarize data
        total_snps, samples, _, _, _, maf_by_record, _ = gtstats("chr$chr.uniqueSNPs.vcf.gz")
        large_maf = findall(x -> x > 0.005, maf_by_record)  

        # generate target file with 100 samples and keep snps with maf>0.005 as typed SNPs
        n = 100
        p = length(large_maf)
        record_idx = falses(total_snps)
        record_idx[large_maf] .= true
        sample_idx = falses(samples)
        sample_idx[1:n] .= true
        shuffle!(sample_idx)
        @time VCFTools.filter("chr$chr.uniqueSNPs.vcf.gz", record_idx, sample_idx, 
            des = "target.chr$chr.typedOnly.vcf.gz")

        # generate target panel with all snps
        @time VCFTools.filter("chr$chr.uniqueSNPs.vcf.gz", 
            1:total_snps, sample_idx, des = "target.chr$chr.full.vcf.gz")

        # also generate reference panel without target samples
        @time VCFTools.filter("chr$chr.uniqueSNPs.vcf.gz", 
            1:total_snps, .!sample_idx, des = "ref.chr$chr.excludeTarget.vcf.gz")

        # unphase and mask 1% entries in target file
        masks = falses(p, n)
        missingprop = 0.001
        for j in 1:n, i in 1:p
            rand() < missingprop && (masks[i, j] = true)
        end
        @time mask_gt("target.chr$chr.typedOnly.vcf.gz", masks, 
            des="target.chr$chr.typedOnly.masked.vcf.gz", unphase=true)

        # generate subset of reference file that matches target file
        @time conformgt_by_pos("ref.chr$chr.excludeTarget.vcf.gz", 
            "target.chr$chr.typedOnly.masked.vcf.gz", 
            "chr$chr.aligned", "$chr", 1:typemax(Int))
        if nrecords("chr$chr.aligned.tgt.vcf.gz") == p
            rm("chr$chr.aligned.tgt.vcf.gz", force=true) # perfect match
        else
            error("target file has SNPs not matching in reference file! Shouldn't happen!")
        end
        mv("chr$chr.aligned.ref.vcf.gz", "ref.chr$chr.aligned.vcf.gz", force=true)
    end 
end
Random.seed!(2020)
@time filter_and_mask()

634.791378 seconds (5.12 G allocations: 482.584 GiB, 6.73% gc time)


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:09:39[39m


458.821619 seconds (5.30 G allocations: 502.982 GiB, 10.16% gc time)
462.232565 seconds (5.45 G allocations: 517.732 GiB, 10.34% gc time)
1071.281820 seconds (13.27 G allocations: 1016.904 GiB, 11.08% gc time)
 19.325729 seconds (119.00 M allocations: 12.223 GiB, 5.94% gc time)


┌ Info: Match target POS to reference POS
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:172
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:19:40[39m


1207.640069 seconds (15.55 G allocations: 1.374 TiB, 14.84% gc time)


┌ Info: 379432 records are matched
└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/conformgt.jl:239


4887.106827 seconds (58.46 G allocations: 5.088 TiB, 11.21% gc time)


### Missing rate

In typed markers, 0.1% of data is missing at random. In addition, 56% of all markers are not typed (i.e. systematically missing). 

In [5]:
tgtfile = "target.chr20.typedOnly.masked.vcf.gz"
reffile = "ref.chr20.excludeTarget.vcf.gz"
missing_rate = 1 - nrecords(tgtfile) / nrecords(reffile)

0.44082516280872497

# MendelImpute on untyped markers with dp

In [2]:
Threads.nthreads()

8

In [7]:
tgtfile = "target.chr20.typedOnly.masked.vcf.gz"
reffile = "ref.chr20.excludeTarget.vcf.gz"
reffile_aligned = "ref.chr20.aligned.vcf.gz"
X_typedOnly_complete = "target.chr20.typedOnly.vcf.gz"
X_full_complete = "target.chr20.full.vcf.gz"
@show nrecords(tgtfile), nsamples(tgtfile)
@show nrecords(reffile), nsamples(reffile)
@show nrecords(reffile_aligned), nsamples(reffile_aligned)
@show nrecords(X_typedOnly_complete), nsamples(X_typedOnly_complete)
@show nrecords(X_full_complete), nsamples(X_full_complete)

(nrecords(tgtfile), nsamples(tgtfile)) = (379432, 100)
(nrecords(reffile), nsamples(reffile)) = (678557, 2404)
(nrecords(reffile_aligned), nsamples(reffile_aligned)) = (379432, 2404)
(nrecords(X_typedOnly_complete), nsamples(X_typedOnly_complete)) = (379432, 100)
(nrecords(X_full_complete), nsamples(X_full_complete)) = (678557, 100)


(678557, 100)

In [None]:
# ad-hoc dp method, keep pairs within 3 of best pair, keep all pairs minimizing diff w/ observed error
cd("/Users/biona001/.julia/dev/MendelImpute/data/1000_genome_phase3_v5/filtered")
Random.seed!(2020)
function run()
#     X_complete = convert_gt(Float32, "target.chr20.typedOnly.vcf.gz")
    X_complete = convert_gt(Float32, "target.chr20.full.vcf.gz")
    n, p = size(X_complete)
    chr = 20
    for width in [250, 500, 1000]
        println("Imputing typed + untyped SNPs with dynamic programming, width = $width")
        tgtfile = "target.chr$chr.typedOnly.masked.vcf.gz"
        reffile = "ref.chr$chr.excludeTarget.vcf.gz"
        outfile = "mendel.imputed.dp$width.vcf.gz"
        @time phase(tgtfile, reffile, outfile=outfile, impute=true, width=width, 
            fast_method=false)
        X_mendel = convert_gt(Float32, outfile)
        println("error overall = $(sum(X_mendel .!= X_complete) / n / p) \n")
    end
end
run()

Imputing typed + untyped SNPs with dynamic programming, width = 250


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:09[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:03:31[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:04:39[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:40[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:11[39m


602.610112 seconds (5.53 G allocations: 415.524 GiB, 16.03% gc time)
error overall = 0.003854178794117517 

Imputing typed + untyped SNPs with dynamic programming, width = 500


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:16[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:05:46[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:08:03[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:02:36[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:11[39m


1062.011519 seconds (7.55 G allocations: 417.936 GiB, 19.28% gc time)
error overall = 0.003793388027829645 

Imputing typed + untyped SNPs with dynamic programming, width = 1000


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:09[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:05:36[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:10:10[39m
[32mMerging breakpoints... 59%|█████████████████▏           |  ETA: 0:03:24[39m

In [7]:
chr = 20
width = 400
tgtfile = "target.chr$chr.typedOnly.masked.vcf.gz"
reffile = "ref.chr$chr.excludeTarget.vcf.gz"
outfile = "mendel.imputed.dp$width.vcf.gz"

X, X_sampleID, X_chr, X_pos, X_ids, X_ref, X_alt = convert_gt(Float32, tgtfile, trans=true, save_snp_info=true, msg = "Importing genotype file...")
H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, reffile, trans=true, save_snp_info=true, msg = "Importing reference haplotype files...")

# match target and ref file by snp position
@time XtoH_idx = indexin(X_pos, H_pos) # X_pos[i] == H_pos[XtoH_idx[i]]
H_aligned     = @view(H[XtoH_idx, :])
H_aligned_chr = @view(H_chr[XtoH_idx])
H_aligned_pos = @view(H_pos[XtoH_idx])
H_aligned_ids = @view(H_ids[XtoH_idx])
H_aligned_ref = @view(H_ref[XtoH_idx])
H_aligned_alt = @view(H_alt[XtoH_idx])

# declare some constants
people = size(X, 2)
haplotypes = size(H, 2)
tgt_snps = size(X, 1)

hs = compute_optimal_halotype_set(X, H_aligned, width = width, fast_method=false)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:17[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:05:45[39m


  0.098114 seconds (56 allocations: 34.427 MiB, 26.01% gc time)


[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:06:54[39m


UndefVarError: UndefVarError: flankwidth not defined

In [8]:
ph = [HaplotypeMosaicPair(tgt_snps) for i in 1:people] # phase information
phase!(ph, X, H_aligned, hs, width=width)

X_full = Matrix{Union{Missing, Float32}}(missing, size(H, 1), people)
copyto!(@view(X_full[XtoH_idx, :]), X)

# convert phase's starting position from X's index to H's index
@time update_marker_position!(ph, XtoH_idx, X_pos, H_pos)

[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:01:38[39m


  0.000132 seconds (5 allocations: 208 bytes)


In [9]:
X_full

678557×100 Array{Union{Missing, Float32},2}:
  missing   missing   missing   missing  …   missing   missing   missing
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
  missing   missing   missing   missing      missing   missing   missing
  missing   missing   missing   missing      missing   missing   missing
 1.0       1.0       1.0       1.0          2.0       2.0       1.0     
 1.0       1.0       1.0       0.0       …  1.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       2.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
  missing   missing   missing   missing      missing   missing   missing
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
  missing   missing   missing   missing  …   missing   missing   missing
 1.0       1.0       1.0       1.0          2.0       2.0       1.0     
  missing   missing   missing   missing      missing   missing   missing
 ⋮    

In [11]:
impute_untyped!(X_full, H_aligned, ph, outfile, X_sampleID, H_pos, H_chr, H_ids, H_ref, H_alt)

ArgumentError: ArgumentError: invalid index: nothing of type Nothing

In [18]:
X_full

678557×100 Array{Union{Missing, Float32},2}:
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 1.0       2.0       1.0       1.0          1.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       2.0       0.0     
 1.0       1.0       1.0       1.0          2.0       2.0       1.0     
 1.0       1.0       1.0       0.0       …  1.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       2.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       2.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  1.0       0.0       0.0     
 1.0       1.0       1.0       1.0          2.0       2.0       1.0     
 0.0       0.0       1.0       1.0          0.0       2.0       0.0     
 ⋮    

In [19]:
X_copy = deepcopy(X_full)
p, n = size(X_copy)
for snp in 1:p, person in 1:n
    if ismissing(X_copy[snp, person])
        #find where snp is located in phase
        hap1_position = searchsortedlast(ph[person].strand1.start, snp)
        hap2_position = searchsortedlast(ph[person].strand2.start, snp)

        #find the correct haplotypes 
        hap1 = ph[person].strand1.haplotypelabel[hap1_position]
        hap2 = ph[person].strand2.haplotypelabel[hap2_position]

        # imputation step 
        X_copy[snp, person] = H[snp, hap1] + H[snp, hap2]
    end
end

In [3]:
# ad-hoc dp method, keep pairs within 3 of best pair, keep all pairs minimizing diff w/ observed error
cd("/Users/biona001/.julia/dev/MendelImpute/data/1000_genome_phase3_v5/filtered")
Random.seed!(2020)
function run()
#     X_complete = convert_gt(Float32, "target.chr20.typedOnly.vcf.gz")
    X_complete = convert_gt(Float32, "target.chr20.full.vcf.gz")
    n, p = size(X_complete)
    chr = 20
    for width in [250, 500, 1000]
        println("Imputing typed + untyped SNPs with dynamic programming, width = $width")
        tgtfile = "target.chr$chr.typedOnly.masked.vcf.gz"
        reffile = "ref.chr$chr.excludeTarget.vcf.gz"
        outfile = "mendel.imputed.dp$width.vcf.gz"
        reffile_aligned = "ref.chr$chr.aligned.vcf.gz"
        @time phase(tgtfile, reffile, reffile_aligned = reffile_aligned, impute=true, 
            outfile = outfile, width = width, fast_method=false)
        X_mendel = convert_gt(Float32, outfile)
        println("error overall = $(sum(X_mendel .!= X_complete) / n / p) \n")
    end
end
run()

Imputing typed + untyped SNPs with dynamic programming, width = 250
Running chunk 1 / 1


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:08[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:01:59[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:03:19[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:36[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:07:33[39m


1096.865042 seconds (10.77 G allocations: 1.017 TiB, 9.56% gc time)
error overall = 0.0011072467014561784 

Imputing typed + untyped SNPs with dynamic programming, width = 500
Running chunk 1 / 1


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:08[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:01:58[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:03:39[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:02:28[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:06:40[39m


1166.970957 seconds (10.59 G allocations: 998.882 GiB, 10.11% gc time)
error overall = 0.0010418726798190868 

Imputing typed + untyped SNPs with dynamic programming, width = 1000
Running chunk 1 / 1


[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:08[39m
[32mImporting reference haplotype files...100%|█████████████| Time: 0:01:58[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:03:09[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:07:19[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:06:47[39m


1440.655884 seconds (10.54 G allocations: 996.445 GiB, 8.32% gc time)
error overall = 0.0011056256143551684 



# Beagle 5.0

In [5]:
# beagle 5
cd("/Users/biona001/.julia/dev/MendelImpute/data/1000_genome_phase3_v5/filtered")
function beagle()
    chr = 20
    tgtfile = "target.chr$chr.typedOnly.masked.vcf.gz"
    reffile = "ref.chr$chr.excludeTarget.vcf.gz"
    outfile = "beagle.imputed"
    Base.run(`java -Xmx15g -jar beagle.28Sep18.793.jar gt=$tgtfile ref=$reffile out=$outfile nthreads=4`)
        
    # beagle error rate    
    X_complete = convert_gt(Float32, "target.chr$chr.full.vcf.gz")
    X_beagle = convert_gt(Float32, "beagle.imputed.vcf.gz")
    n, p = size(X_complete)
    println("error overall = $(sum(X_beagle .!= X_complete) / n / p) \n")
end
beagle()

beagle.28Sep18.793.jar (version 5.0)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.28Sep18.793.jar" to list command line argument
Start time: 07:14 PM PDT on 26 Apr 2020

Command line: java -Xmx13653m -jar beagle.28Sep18.793.jar
  gt=target.chr20.typedOnly.masked.vcf.gz
  ref=ref.chr20.excludeTarget.vcf.gz
  out=beagle.imputed
  nthreads=4

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:       2,404
Study samples:             100

Window 1 (20:60479-40060263)
Reference markers:     404,617
Study markers:         225,844

Burnin  iteration 1:           25 seconds
Burnin  iteration 2:           1 minute 18 seconds
Burnin  iteration 3:           48 seconds
Burnin  iteration 4:           46 seconds
Burnin  iteration 5:           50 seconds
Burnin  iteration 6:           50 seconds

Phasing iteration 1:           46 seconds
Phasing iteration 2:           45 seconds
Phasing iteration 3:           45 seconds
Phasing iteration 4:           45 seconds
Phas

# Eagle 2 + Minimac4

In order to use the reference panel in Eagle 2's prephase option, one must first convert it to `.bcf` format via e.g. `htslib` which is *extremely* difficult to install. Even after we went through all the hard work to obtain the final `.bcf` reference file (see commands below), eagle 2.4 STILL SAYS the file is not acceptable (not bgzipped or some processing error). Therefore, I have no choice but to prephase without the reference panel. 

In [None]:
# run eagle 2.4: 3367.79 sec on amd-2382 machine (can only run on linux systems)
eagle --vcf=target.chr20.typedOnly.masked.vcf.gz --outPrefix=eagle.phased.chr20 --numThreads=4 --geneticMapFile=../Eagle_v2.4.1/tables/genetic_map_hg19_withX.txt.gz

In [None]:
# convert ref file to m3vcf format (Total Run completed in 1 hours, 46 mins, 24 seconds)
/u/home/b/biona001/haplotype_comparisons/Minimac3/bin/Minimac3 --refHaps ref.chr20.excludeTarget.vcf.gz --processReference --prefix ref.chr20.excludeTarget

In [None]:
# run minimac4 (2619 seconds)
minimac4 --refHaps ref.chr20.excludeTarget.m3vcf.gz --haps eagle.phased.vcf.gz --prefix minimac.imputed.chr20 --format GT --cpus 4

In [None]:
# minimac4 error rate    
X_complete = convert_gt(Float32, "target.chr20.full.vcf.gz")
X_minimac = convert_gt(Float32, "minimac.imputed.chr20.dose.vcf.gz")
n, p = size(X_complete)
println("error overall = $(sum(X_minimac .!= X_complete) / n / p) \n")