# Genotype imputation and phasing

This notebook showcases our software [MendelImpute.jl](https://github.com/biona001/MendelImpute.jl), the fastest and most memory-efficient phasing/imputation software as of 2020. Our paper will be available on arxiv soon.

## Package installation

In [1]:
# Uncomment to add necessary packages
# using Pkg
# Pkg.add(PackageSpec(url="https://github.com/OpenMendel/SnpArrays.jl.git"))
# Pkg.add(PackageSpec(url="https://github.com/OpenMendel/VCFTools.jl.git"))
# Pkg.add(PackageSpec(url="https://github.com/OpenMendel/MendelImpute.git"))
# Pkg.add(["GeneticVariation", "Random", "SparseArrays", "Plots"])

In [1]:
# Load necessary packages
using MendelImpute
using VCFTools
using GeneticVariation
using Random
using SparseArrays
using UnicodePlots

# What is imputation and phasing?

+ Making standard SNP array (GWAS) data more like sequence data by increasing the number of variants that are available for study
+ Inputs: 
    + Unphased (entries 0, 1, 2) genotypes at ~1 million markers 
    + A reference haplotype panel (entries 0, 1) at $\ge 40$ million markers
+ Output:
    + Phased genotypes at all markers (if genotype is 1, you know which chromosome it is on)
    
# Method overview

We take a data driven approach for genotype imputation and phasing. For each target genotype, we try to select 2 haplotype segments from the reference panel to match it up as best as possible within a small genomic window. Numerous interacting tactics are summarized in the cartoon illustration below. 



![fdsa](method.png)



Overview of *MendelImpute*'s algorithm. (a) After alignment, imputation and phasing are carried out on short, non-overlapping windows of the typed SNPs. (b) Based on a least squares criterion, we find two unique haplotypes whose vector sum approximates the genotype vector on the current window. Once this is done, all reference haplotypes corresponding to these two unique haplotypes are assembled into two sets of candidate haplotypes. (c) We intersect candidate haplotype sets window by window, carrying along the surviving set and switching orientations if the result generates more surviving haplotypes. (d) After three windows the top extended chromosome possess no surviving haplotypes, but a switch to the second orientation in the current window allows $h_5$ to survive on the top chromosome. Eventually we must search for a break point separating $h_1$ from $h_2$ or $h_6$ between windows 3 and 4 (bottom panel).

## Example data

Using [msprime](https://msprime.readthedocs.io/en/stable/) and our [VCFTools.jl](https://github.com/OpenMendel/VCFTools.jl) package, the `simulate_data.ipynb` notebook simulated data for this demonstration:

+ **`ref.excludeTarget.vcf.gz`**: haplotype reference panel with 2400 samples on 36063 SNPs.
+ **`target.typedOnly.masked.vcf.gz`**: unphased target genotypes with 100 samples, 5000 SNPs, and 1% genotypes missing at random.
+ **`target.full.vcf.gz`**: complete genotype information (for checking imputation accuracy).

# Imputation and phasing

Now we illustrate how to use `MendelImpute.jl` for phasing and imputation. 

## Step 1: Generate `.jlso` compressed reference panel

MendelImpute requires one to pre-process the reference panel for faster reading. This is achieved via the [compress_haplotypes](https://openmendel.github.io/MendelImpute.jl/dev/man/api/#MendelImpute.compress_haplotypes) function.

In [2]:
reffile = "./data/ref.excludeTarget.vcf.gz"
tgtfile = "./data/target.typedOnly.masked.vcf.gz"
outfile = "./data/ref.excludeTarget.jlso"
@time compress_haplotypes(reffile, tgtfile, outfile)

[32mimporting reference data...100%|████████████████████████| Time: 0:00:13[39m


 27.782844 seconds (212.40 M allocations: 14.952 GiB, 9.69% gc time)


## Step 2: Run MendelImpute!

Note: First run will be slower. Run twice for more accurate timing result. 


In [4]:
tgtfile = "./data/target.typedOnly.masked.vcf.gz"
reffile = "./data/ref.excludeTarget.jlso"
outfile = "./data/imputed.vcf.gz"
phase(tgtfile, reffile, outfile);

Number of threads = 1
Importing reference haplotype data...
Total windows = 16, averaging ~ 708 unique haplotypes per window.

Timings: 
    Data import                     = 0.585041 seconds
        import target data             = 0.384378 seconds
        import compressed haplotypes   = 0.200663 seconds
    Computing haplotype pair        = 0.298163 seconds
        BLAS3 mul! to get M and N      = 0.0282044 seconds per thread
        haplopair search               = 0.250691 seconds per thread
        initializing missing           = 0.00429557 seconds per thread
        allocating and viewing         = 0.0147918 seconds per thread
        index conversion               = 5.539e-5 seconds per thread
    Phasing by win-win intersection = 0.0617189 seconds
        Window-by-window intersection  = 0.00103244 seconds per thread
        Breakpoint search              = 0.0587399 seconds per thread
        Recording result               = 0.0017482 seconds per thread
    Imputation       

## Post-analysis: Import, manipulate, and visualize data

**Import:** OpenMendel's `VCFTools.jl` allow us to import VCF files directly as numeric matrices. Thus it supports all standard matrix operations and can be directly plugged into any Julia package that accepts matrices as inputs. For simplicity, we will calculate the error rate:

In [5]:
# import genotypes as double-precision matrices
Ximputed = convert_gt(Float64, "./data/imputed.vcf.gz")
Xtrue = convert_gt(Float64, "./data/target.full.vcf.gz")
err = sum(Xtrue .!= Ximputed) / size(Xtrue, 1) / size(Xtrue, 2)
println("error rate = $err")

error rate = 0.0007129190583146162


**Manipulate:** Genotype matrices permit all standard matrix operations (BLAS, LAPACK...etc), so users gain access to statistical analysis *immediately* after imputation. 

Here, we multiply the first 10 rows and 10 columns of the imputed matrix by a random vector, getting another random vector. 

In [6]:
Ximputed[1:10, 1:10] * rand(10)

10-element Array{Union{Missing, Float64},1}:
 1.0201571201911694
 0.9522113232558804
 1.6069507335118702
 1.6069507335118702
 1.4507189839267838
 1.2795810283838753
 1.13084507188393
 2.037512597247485
 1.6069507335118702
 1.6069507335118702

**Visualize**: Julia is shipped with a rich plotting ecosystem. One application would be to see if there are specific regions of the chromosomes that seem to have more errors than other regions. Below we plot the error distribution for the first 200 SNPs:

In [7]:
disagreeing_entries = sparse(Xtrue .!= Ximputed)
spy(disagreeing_entries[:, 1:200], title="Visualized imputation error")

[1m                           Visualized imputation error[22m
[90m       ┌───────────────────────────────────────────────────────────────────┐[39m    
     [90m1[39m[90m │[39m[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[90m│[39m [31m> 0[39m
      [90m │[39m[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[90m│[39m [34m< 0[39m
      [90m │[39m[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m⠀[0m

In the figure, each row is a sample genotype, each column is a SNP, and each dot represent an imputation error. Thus, in the top left panel, the line of dots (in sample 28) suggest the given sample is not imputed well in that region. 

**Post-imputation quality control**: Naturally, we seek a post-imputation quality score for each SNP (or each sample). This is not demonstrated in our talk, but [our documentation](https://openmendel.github.io/MendelImpute.jl/dev/man/Phasing_and_Imputation/#Post-imputation:-per-SNP-Imputation-Quality-Score) details this process. 

# Conclusion 

+ OpenMendel offers a fast and intuitive pipeline for genotype imputation and phasing
+ After imputation, your data is immediately ready for statistical analysis and visualization

# Other package features

+ Built-in support for imputing genotypes stored in VCF files (.vcf, .vcf.gz) or PLINK files.
+ Out-of-the-box multithreaded (shared memory) parallelism.
+ Admixture estimation, with code examples to make pretty plots!
+ Ultra-compressed file for phased genotypes.
+ Imputation on dosage data

**All code is on GitHub:** https://github.com/OpenMendel/MendelImpute.jl

**Latest documentation:** https://openmendel.github.io/MendelImpute.jl/dev/

**Our paper is going live on bioarxiv within a week.** Once it is up we will update our github page, please stay tuned!