# Generating simulated haplotype files

This notebook describes how the genotype and haplotype reference files are generated.

In [50]:
using CSV
using VCFTools
using Random
using DelimitedFiles
using MendelImpute
using Plots
using SparseArrays

### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 20000 10000 10000000 2e-8 2e-8 2019 > simulation.vcf
```

Arguments: 
+ Number of haplotypes = 20000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files (in .vcf format)

In the future, this should be done by `VCFTools.jl`. Currently we are brute forcing our way through.

The last 200 people (400 haplotypes) are treated as imputation targets, where 10% of entries are masked randomly. The remaining haplotypes become reference panels. The number of markers fluctuates between 40,000 and 100,000, and we truncate at the 10-thousand range (e.g. if we have 62,345 snps, we only keep first 60,000). 

In [2]:
@time x = CSV.read("simulation.vcf", header=6)
x[!, :ID] .= '.'  # these columns have missing data, which should be represented as . not 0.0
x[!, :QUAL] .= '.'
x[!, :INFO] .= '.'
x[!, :REF] .= 'A' #set all reference allele to A instead of 0
x[!, :ALT] .= 'T'; #set all alternate allele to T instead of 1

171.382279 seconds (18.32 M allocations: 899.590 MiB, 0.26% gc time)


#### Filter out SNPs with same POS, which shouldn't exist

In [3]:
unique_rows = indexin(unique(x[!, :POS]), x[!, :POS])
x_unique = x[unique_rows, :]
@assert length(unique(x_unique[!, :POS])) == size(x_unique, 1)
x_unique

Unnamed: 0_level_0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,tsk_0,tsk_1,tsk_2
Unnamed: 0_level_1,Int64,Int64,Char,Char,Char,Char,String,Char,String,String,String,String
1,1,41,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
2,1,153,'.','A','T','.',PASS,'.',GT,1|1,1|1,1|1
3,1,203,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
4,1,244,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
5,1,475,'.','A','T','.',PASS,'.',GT,1|1,0|1,1|1
6,1,500,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
7,1,570,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
8,1,593,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
9,1,633,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0
10,1,833,'.','A','T','.',PASS,'.',GT,0|0,0|0,0|0


In [4]:
ref_end = size(x_unique, 2) - 400
num_snp = Int(round(size(x_unique, 1), sigdigits = 1))

first_9_cols = x_unique[1:num_snp, 1:9]
refhap = [first_9_cols x_unique[1:num_snp, 10:ref_end]]
target = [first_9_cols x_unique[1:num_snp, (ref_end+1):end]]

CSV.write("target.vcf", target, delim='\t')
CSV.write("haplo_ref.vcf", refhap, delim='\t')

"haplo_ref.vcf"

#### Add the following to the header of `target.vcf` and `haplo_ref.vcf`
```
##fileformat=VCFv4.2
##source=vcftools
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
```

### Step 3: Import target genotype file and randomly mask entries

+ `X` is the genotype matrix (without missing entries)
+ `Xm` is the masked genotype matrix (imputation target)

In [5]:
@time X = convert_gt(Float32, "target.vcf"; model = :additive)

 27.660199 seconds (199.83 M allocations: 17.505 GiB, 65.88% gc time)


400×80000 Array{Union{Missing, Float32},2}:
 0.0  0.0  1.0  0.0  2.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  2.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  2.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  1.0  0.0  0.0  0.0  0.0  0.0  2.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  2.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  …  1.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0 

In [6]:
n, p = size(X)
Random.seed!(123)
missingprop = 0.1

Xm = copy(X)
Xm = ifelse.(rand(Float32, n, p) .< missingprop, missing, Xm)

400×80000 Array{Union{Missing, Float32},2}:
 0.0       0.0       1.0       0.0       …  0.0       0.0       1.0     
 0.0       1.0       0.0       0.0          0.0       0.0       1.0     
 0.0       0.0       0.0       0.0          0.0       0.0       2.0     
 0.0       0.0       0.0       0.0          0.0       0.0       1.0     
  missing   missing   missing  0.0          0.0       0.0       2.0     
 0.0       0.0       0.0        missing  …  0.0       0.0        missing
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           missing  0.0        missing
 0.0       0.0       0.0       0.0          0.0       0.0        missing
 0.0       1.0       0.0       0.0          0.0       0.0       1.0     
 0.0        missing  0.0       0.0       …  0.0       0.0        missing
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0        missing  1.0     
 ⋮     

### Step 4: Import (transposed) haplotype reference panels

In the future, this should be done by `VCFTools.jl`. Currently we are brute forcing our way through.

In `haplo_ref.vcf`:
+ Each column is a person's (phased) genotype
+ Each row is a SNP

We will transpose this data as we convert it to a numeric matrix because `MendelImpute` currently (11/13/2019) requires transposed version.

In [26]:
@time haplo_ref = CSV.read("haplo_ref.vcf", header=6)[:, 10:end]

236.333037 seconds (469.34 k allocations: 2.889 GiB, 40.03% gc time)


Unnamed: 0_level_0,tsk_0,tsk_1,tsk_2,tsk_3,tsk_4,tsk_5,tsk_6,tsk_7,tsk_8,tsk_9
Unnamed: 0_level_1,String,String,String,String,String,String,String,String,String,String
1,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0
2,1|1,1|1,1|1,1|0,0|0,1|1,1|0,1|0,1|1,0|1
3,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0
4,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0
5,1|1,0|1,1|1,1|0,0|0,1|0,0|0,0|0,1|1,0|1
6,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|1,0|0,0|0
7,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0
8,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0
9,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0,0|0
10,0|0,0|0,0|0,0|0,0|0,0|0,0|0,1|0,0|0,0|0


In [27]:
function parse_phased_data(haplo_ref)
    T = Float32
    p, d = size(haplo_ref)
    H = zeros(T, p, 2d)
    split_holder = Vector{String}(undef, 2)
    
    for j in 1:d, i in 1:p
        split_holder .= split(haplo_ref[i, j], '|')
        H[i, 2j - 1] = (split_holder[1] == "0" ? zero(T) : one(T))
        H[i, 2j]     = (split_holder[2] == "0" ? zero(T) : one(T))
    end
    
    return H
end

parse_phased_data (generic function with 1 method)

In [28]:
@time H = parse_phased_data(haplo_ref)

600.060249 seconds (8.44 G allocations: 245.977 GiB, 19.04% gc time)


80000×19200 Array{Float32,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0     1.0  1.0  1.0  1.0  0.0  1.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  1.0  1.0  1.0  1.0  0.0     1.0  1.0  0.0  1.0  0.0  1.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0

### Step 5: Test MendelImpute

First need to transpose genotype data as well, because `MendelImpute` currently (11/13/2019) requires transposed version.

In [31]:
X = copy(X')
Xm = copy(Xm')
Xm_original = copy(Xm)

80000×400 Array{Union{Missing, Float32},2}:
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  1.0       0.0       0.0          0.0       0.0  1.0       1.0     
 1.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 2.0   missing  2.0       0.0           missing  2.0  1.0       1.0     
 0.0   missing  0.0        missing  …  0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0        missing
 0.0  0.0       0.0       0.0           missing  0.0  0.0       0.0     
 0.0   missing  0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       1.0          1.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  0.0        missing  0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 ⋮     

In [35]:
Xm

80000×400 Array{Union{Missing, Float32},2}:
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  1.0       0.0       0.0          0.0       0.0  1.0       1.0     
 1.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 2.0   missing  2.0       0.0           missing  2.0  1.0       1.0     
 0.0   missing  0.0        missing  …  0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0        missing
 0.0  0.0       0.0       0.0           missing  0.0  0.0       0.0     
 0.0   missing  0.0       0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       1.0          1.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0       …  0.0       0.0  0.0       0.0     
 0.0  0.0        missing  0.0          0.0       0.0  0.0       0.0     
 0.0  0.0       0.0       0.0          0.0       0.0  0.0       0.0     
 ⋮     

In [36]:
H

80000×19200 Array{Float32,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0     1.0  1.0  1.0  1.0  0.0  1.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  1.0  1.0  1.0  1.0  0.0     1.0  1.0  0.0  1.0  0.0  1.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0

In [46]:
@time hapset, ph = phase2(Xm, H, width=5000);

 80.769230 seconds (7.31 M allocations: 462.760 MiB, 34.36% gc time)


### Step 6: Calculate error

#### 2000 haplotypes error = 0.08744580304802659

In [91]:
impute2!(Xm, H, phase)
missing_idx    = ismissing.(Xm_original)
total_missing  = sum(missing_idx)
actual_missing_values  = convert(Vector{Int64}, X[missing_idx])  #true values of missing entries
imputed_missing_values = convert(Vector{Int64}, Xm[missing_idx]) #imputed values of missing entries
error_rate = sum(actual_missing_values .!= imputed_missing_values) / total_missing
@show error_rate
copyto!(Xm, Xm_original);

error_rate = 0.08744580304802659


#### 20,000 haplotypes error = 0.05610338308703217 (width = 5000)

In [47]:
impute2!(Xm, H, ph)
missing_idx    = ismissing.(Xm_original)
total_missing  = sum(missing_idx)
actual_missing_values  = convert(Vector{Int64}, X[missing_idx])  #true values of missing entries
imputed_missing_values = convert(Vector{Int64}, Xm[missing_idx]) #imputed values of missing entries
error_rate = sum(actual_missing_values .!= imputed_missing_values) / total_missing
@show error_rate
copyto!(Xm, Xm_original);

error_rate = 0.05610338308703217


In [51]:
H_sp = sparse(H)

80000×19200 SparseMatrixCSC{Float32,Int64} with 148564334 stored entries:
  [2    ,     1]  =  1.0
  [5    ,     1]  =  1.0
  [14   ,     1]  =  1.0
  [19   ,     1]  =  1.0
  [26   ,     1]  =  1.0
  [33   ,     1]  =  1.0
  [36   ,     1]  =  1.0
  [38   ,     1]  =  1.0
  [44   ,     1]  =  1.0
  [48   ,     1]  =  1.0
  [52   ,     1]  =  1.0
  [54   ,     1]  =  1.0
  ⋮
  [79927, 19200]  =  1.0
  [79928, 19200]  =  1.0
  [79929, 19200]  =  1.0
  [79936, 19200]  =  1.0
  [79937, 19200]  =  1.0
  [79940, 19200]  =  1.0
  [79944, 19200]  =  1.0
  [79945, 19200]  =  1.0
  [79960, 19200]  =  1.0
  [79963, 19200]  =  1.0
  [79988, 19200]  =  1.0
  [79989, 19200]  =  1.0
  [80000, 19200]  =  1.0

In [53]:
using UnicodePlots
spy(H_sp)



InterruptException: InterruptException: