# Tutorial on genotype imputation

This notebook showcase a few examples of the software [MendelImpute.jl](https://github.com/biona001/MendelImpute) (a work in progress).

## Package installation

In [None]:
# machine information for reproducibility
versioninfo()

In [None]:
#load necessary packages, install them if you don't have it
using MendelImpute
using VCFTools
using GeneticVariation
using Random
using SparseArrays
using Plots

In [None]:
]add https://github.com/biona001/MendelImpute

In [None]:
]add https://github.com/OpenMendel/VCFTools.jl

# Simulate sample data

We simulate 1000 reference haplotypes each with 3000 SNPs. Then we use this haplotype matrix to construct a genotype matrix matrix with 100 samples and 3000 SNPs. Each person's genotype are separated into 1-5 chunks, where 2 haplotypes are chosen with replacement to fill the chunk. The genotype matrix are randomly masked to generate the missing values. 

The result is stored in 3 files:

+ **`haplo_ref.vcf.gz`**: haplotype reference files
+ **`target.vcf.gz`**: complete genotype information
+ **`target_masked.vcf.gz`**: incomplete genotype information (10% are missing)

In [None]:
# 1000 haplotypes each with 3000 SNPs, 100 imputation targets 
snps   = 3000
haps   = 1000
people = 100

# simulate full haplotype and genotype matrix
@time H = simulate_markov_haplotypes(snps, haps)
@time X = simulate_genotypes(H, people)

# randomly mask entries
Random.seed!(2020)
missingprop = 0.01
@time Xm = ifelse.(rand(snps, people) .< missingprop, missing, X)

# write 3 files to disk
@time make_refvcf_file(H, vcffilename="./data/haplo_ref.vcf.gz")
@time make_tgtvcf_file(X, vcffilename="./data/target.vcf.gz")
@time make_tgtvcf_file(Xm, vcffilename="./data/target_masked.vcf.gz")

# Example: Import, manipulate, and visual data

OpenMendel's `VCFTools.jl` allow us to import VCF files directly as numeric matrix:

In [None]:
# read target data that needs imputation
Xm = convert_gt(Float64, "./data/target_masked.vcf.gz")

### Important notes: 
+ Observed data is 0, 1, or 2, so target matrix **does not** have to be pre-phased!
+ **Dosage** data is supported natively!
+ Typical matrix operations are permitted in the usual way!

In [None]:
Xm[1:10, 1:10] * rand(10)

Visualize missingness (each dot is a missing value):

In [None]:
Xm_missing_idx = randn(size(Xm)) .* sparse(ismissing.(Xm))
spy(Xm_missing_idx, size=(3000,200), 
    title="Distribution of Missing Data (each dot is a missing, colors don't mean anything)", 
    title_location=:left, markersize = 20)

# Let's impute this matrix!

Run the following command:

In [None]:
# impute
tgtfile = "./data/target_masked.vcf.gz"
reffile = "./data/haplo_ref.vcf.gz"
outfile = "./data/imputed_target.vcf.gz"
width   = 400
@time phase(tgtfile, reffile, impute=true, outfile = outfile, width = width);

The result can be outputed as uncompressed `.vcf` or compressed `.vcf.gz` formats. 

## Import imputed result

In [None]:
X_mendel = convert_gt(Float64, outfile)

## Compare with true data

In [None]:
# import complete genotype info
X_true = convert_gt(Float64, "./data/target.vcf.gz")
error_rate = sum(X_true .!= X_mendel) / snps / people

## Visualize where did our imputation went wrong

In [None]:
disagreeing_entries = randn(size(X_true)) .* sparse(X_true .!= X_mendel)
Plots.spy(disagreeing_entries, size=(3000,200), title="Visualize imputation error (dots = imputed incorrectly)", title_location=:left, markersize = 20)

# Conclusion & package feature

+ Our pipeline supports importing, manipulating, and visualizing raw genotype data
+ Genotype data (imputation target) does **not** have be be pre-phased. 
+ Genotype data (imputation target) can be dosage data!