# Generating simulated haplotype files

This notebook describes how the genotype and haplotype reference files are generated.

In [1]:
using Revise
using CSV
using VCFTools
using Random
using DelimitedFiles
using MendelImpute
using Plots
using SparseArrays

└ @ Revise /Users/biona001/.julia/packages/Revise/439di/src/Revise.jl:1108


### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 20000 10000 10000000 2e-8 2e-8 2019 > simulation.vcf
```

Arguments: 
+ Number of haplotypes = 20000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files (in .vcf format)

In the future, this should be done by `VCFTools.jl`. Currently we are brute forcing our way through.

The last 200 people (400 haplotypes) are treated as imputation targets, where 10% of entries are masked randomly. The remaining haplotypes become reference panels. The number of markers fluctuates between 40,000 and 100,000, and we truncate at the 10-thousand range (e.g. if we have 62,345 snps, we only keep first 60,000). 

In [11]:
@time x = CSV.read("simulation.vcf", header=6)
x[!, :ID] .= '.'  # these columns have missing data, which should be represented as . not 0.0
x[!, :QUAL] .= '.'
x[!, :INFO] .= '.'
x[!, :REF] .= 'A' #set all reference allele to A instead of 0
x[!, :ALT] .= 'T'; #set all alternate allele to T instead of 1

InterruptException: InterruptException:

#### Filter out SNPs with same POS, which shouldn't exist

In [12]:
unique_rows = indexin(unique(x[!, :POS]), x[!, :POS])
x_unique = x[unique_rows, :]
@assert length(unique(x_unique[!, :POS])) == size(x_unique, 1)
x_unique

UndefVarError: UndefVarError: x not defined

In [4]:
ref_end = size(x_unique, 2) - 400
num_snp = Int(round(size(x_unique, 1), sigdigits = 1))

first_9_cols = x_unique[1:num_snp, 1:9]
refhap = [first_9_cols x_unique[1:num_snp, 10:ref_end]]
target = [first_9_cols x_unique[1:num_snp, (ref_end+1):end]]

CSV.write("target.vcf", target, delim='\t')
CSV.write("haplo_ref.vcf", refhap, delim='\t')

"haplo_ref.vcf"

#### Add the following to the header of `target.vcf` and `haplo_ref.vcf`
```
##fileformat=VCFv4.2
##source=vcftools
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
```

### Step 3: Import target genotype file and randomly mask entries

+ `X` is the genotype matrix (without missing entries)
+ `Xm` is the masked genotype matrix (imputation target)

In [13]:
@time X = convert_gt(Float32, "target.vcf"; model = :additive)

 12.004489 seconds (194.20 M allocations: 17.230 GiB, 27.62% gc time)


400×80000 Array{Union{Missing, Float32},2}:
 0.0  0.0  1.0  0.0  2.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  2.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  2.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  1.0  0.0  0.0  0.0  0.0  0.0  2.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  1.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0 

In [14]:
n, p = size(X)
Random.seed!(123)
missingprop = 0.1

Xm = copy(X)
Xm = ifelse.(rand(Float32, n, p) .< missingprop, missing, Xm)

400×80000 Array{Union{Missing, Float32},2}:
 0.0       0.0       1.0       0.0       …  0.0       0.0       1.0     
 0.0       1.0       0.0       0.0          0.0       0.0       1.0     
 0.0       0.0       0.0       0.0          0.0       0.0       2.0     
 0.0       0.0       0.0       0.0          0.0       0.0       1.0     
  missing   missing   missing  0.0          0.0       0.0       2.0     
 0.0       0.0       0.0        missing  …  0.0       0.0        missing
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           missing  0.0        missing
 0.0       0.0       0.0       0.0          0.0       0.0        missing
 0.0       1.0       0.0       0.0          0.0       0.0       1.0     
 0.0        missing  0.0       0.0       …  0.0       0.0        missing
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0        missing  1.0     
 ⋮     

### Step 4: Import (transposed) haplotype reference panels

In `haplo_ref.vcf`:
+ Each column is a person's (phased) genotype
+ Each row is a SNP

We will transpose this data as we convert it to a numeric matrix because `MendelImpute` currently (11/13/2019) requires transposed version.

In [9]:
@time H = convert_ht(Float32, "haplo_ref.vcf")

263.062229 seconds (4.61 G allocations: 417.358 GiB, 28.70% gc time)


19200×80000 Array{Float32,2}:
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0

In [10]:
@time H = copy(H')

 12.967581 seconds (126.72 k allocations: 5.728 GiB, 0.04% gc time)


80000×19200 Array{Float32,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0     0.0  1.0  0.0  1.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0

### Step 5: Test MendelImpute

First need to transpose genotype data as well, because `MendelImpute` currently (11/13/2019) requires transposed version.

In [15]:
X = copy(X')
Xm = copy(Xm')
Xm_original = copy(Xm)

80000×400 Array{Union{Missing, Float32},2}:
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  1.0       0.0       0.0          0.0       0.0  1.0       1.0     
 1.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 2.0   missing  2.0       0.0           missing  2.0  1.0       1.0     
 0.0   missing  0.0        missing  …  0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0        missing
 0.0  0.0       0.0       0.0           missing  0.0  0.0       0.0     
 0.0   missing  0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       1.0          1.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  0.0        missing  0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 ⋮     

In [16]:
Xm

80000×400 Array{Union{Missing, Float32},2}:
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  1.0       0.0       0.0          0.0       0.0  1.0       1.0     
 1.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 2.0   missing  2.0       0.0           missing  2.0  1.0       1.0     
 0.0   missing  0.0        missing  …  0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0        missing
 0.0  0.0       0.0       0.0           missing  0.0  0.0       0.0     
 0.0   missing  0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       1.0          1.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  0.0        missing  0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 ⋮     

In [17]:
H

80000×19200 Array{Float32,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0     0.0  1.0  0.0  1.0  0.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0

In [30]:
@time hapset, ph = phase2(Xm, H, width=3000); 

 30.927562 seconds (113.62 k allocations: 260.615 MiB, 13.40% gc time)


### Step 6: Calculate error

#### 2000 haplotypes  
+ 

#### 20,000 haplotypes (19200 as reference panels)
+ width 50: error = 0.020779933032235538, time = 52.474786 sec
+ width 400: error = 0.003887765749620765, time = 22.658061 sec
+ width 800: error = 0.003113523197137645, time = 25.785471 sec
+ width 1200: error = 0.0029020822096301534, time = 25.103432 sec
+ width 2000: error = 0.0027278073632059576, time = 27.418955 sec
+ width 3000: error = 0.003086038991966804, time = 30.927562 sec
+ width 5000: error = 0.004203834109085435, time = 37.827126 sec

In [31]:
impute2!(Xm, H, ph)
missing_idx    = ismissing.(Xm_original)
total_missing  = sum(missing_idx)
actual_missing_values  = convert(Vector{Int64}, X[missing_idx])  #true values of missing entries
imputed_missing_values = convert(Vector{Int64}, Xm[missing_idx]) #imputed values of missing entries
error_rate = sum(actual_missing_values .!= imputed_missing_values) / total_missing
@show error_rate
copyto!(Xm, Xm_original);

error_rate = 0.003086038991966804
