# Example Data Generation

This notebook documents how to simulate realistic haplotypes using [msprime](https://msprime.readthedocs.io/en/stable/#), then how to processed the result into target genotype data and reference haplotype panels using our [VCFTools.jl](https://github.com/OpenMendel/VCFTools.jl) package.

**Note:** For demonstration purposes, we simulated an *extremely small* reference panel. 

In [1]:
# install Julia packages needed
using Pkg
Pkg.add(PackageSpec(url="https://github.com/OpenMendel/MendelImpute.jl.git"))
Pkg.add(PackageSpec(url="https://github.com/OpenMendel/VCFTools.jl.git"))
Pkg.add("Random")
Pkg.add("UnicodePlots")

# load necessary packages in Julia
using MendelImpute
using VCFTools
using Random
using UnicodePlots

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1278


## Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


## Step 1. Simulate phased haplotypes 

The following command was executed in the terminal in the data folder:

```
python3 msprime_script.py 5000 10000 5000000 2e-8 2e-8 2020 > full.vcf
```

Argument meaning: 
+ Number of haplotypes = 5000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 5 million
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2020

The resulting `full.vcf` is a VCF file containing 2500 phased genotypes each with 36063 SNPs.

In [6]:
data = joinpath(normpath(MendelImpute.datadir()), "full.vcf") # get data directory
nsamples(data), nrecords(data)

(2500, 36063)

## Step 2: Convert simulated data to reference and target files

Starting with simulated data `full.vcf`, we use 100 genotypes as imputation targets, and the rest is used as reference panel. Filtering is achieved with utilies in [VCFTools.jl](https://github.com/OpenMendel/VCFTools.jl). We randomly choose 10,000 SNPs with minor allele frequency $\ge 0.05$ as the typed positions. Note data must conform to [MendelImpute's data preparation requirement](https://openmendel.github.io/MendelImpute.jl/dev/man/Phasing+and+Imputation/#Preparing-Target-Data)

In [9]:
# change directory to under /data
cd(normpath(MendelImpute.datadir()))

# set random seed for reproducibility
Random.seed!(2020)

# remove SNPs with the same positions, keep all samples, save result into new file
SNPs_to_keep = .!find_duplicate_marker(data) 
VCFTools.filter(data, SNPs_to_keep, 1:nsamples(data), des = "uniqueSNPs.vcf.gz")

# summarize data
total_snps, samples, _, _, _, maf_by_record, _ = gtstats("uniqueSNPs.vcf.gz")

# generate target file with 100 samples and 5k snps with maf>0.05
n = 100
p = 5000
record_idx = falses(total_snps)
large_maf = findall(x -> x > 0.05, maf_by_record)  
Random.shuffle!(large_maf)
record_idx[large_maf[1:p]] .= true
sample_idx = falses(samples)
sample_idx[1:n] .= true
Random.shuffle!(sample_idx)
VCFTools.filter("uniqueSNPs.vcf.gz", record_idx, sample_idx, 
    des = "target.typedOnly.vcf.gz", allow_multiallelic=false)

# unphase and mask 1% entries in target file
masks = falses(p, n)
missingprop = 0.01
for j in 1:n, i in 1:p
    rand() < missingprop && (masks[i, j] = true)
end
mask_gt("target.typedOnly.vcf.gz", masks, 
    des="target.typedOnly.masked.vcf.gz", unphase=true)

# generate target panel with all snps (containing true phase and genotypes)
VCFTools.filter("uniqueSNPs.vcf.gz", 1:total_snps, 
    sample_idx, des = "target.full.vcf.gz", allow_multiallelic=false)

# generate reference panel
VCFTools.filter("uniqueSNPs.vcf.gz", 1:total_snps, .!sample_idx, 
    des = "ref.excludeTarget.vcf.gz", allow_multiallelic=false)

[32mfinding duplicate markers...100%|███████████████████████| Time: 0:00:20[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:29[39m
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:26[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:24[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:26[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:00:41[39m


## Step 3: generating `.jlso` compressed reference panel

MendelImpute requires one to pre-process the reference panel for faster reading. This is achieved via the [compress_haplotypes](https://openmendel.github.io/MendelImpute.jl/dev/man/api/#MendelImpute.compress_haplotypes) function.

In [11]:
reffile = "ref.excludeTarget.vcf.gz"
tgtfile = "target.typedOnly.masked.vcf.gz"
outfile = "ref.excludeTarget.jlso"
@time compress_haplotypes(reffile, tgtfile, outfile)

[32mimporting reference data...100%|████████████████████████| Time: 0:00:10[39m


 19.730624 seconds (185.85 M allocations: 13.691 GiB, 5.74% gc time)


## Output explanation:

You just generated reference and target VCF files:

+ `ref.excludeTarget.jlso`: Compressed reference haplotype panel with 4800 haplotypes (in JLSO format)
+ `target.typedOnly.masked.vcf.gz`: Imputation target file containing 100 samples at 5k SNPs. All genotypes are unphased and contains 1% missing data. 

You also generated/downloaded:

+ `full.vcf`: The original simulated data from `msprime`.
+ `uniqueSNPs.vcf.gz`: This is the original data excluding duplicate records (SNPs) by checking marker positions. 
+ `ref.excludeTarget.vcf.gz`: Reference haplotype panel (in VCF format)
+ `target.full.vcf.gz`: The complete data for imputation target, used for checking imputation accuracy. All genotypes are phased and non-missing. 
+ `target.typedOnly.vcf.gz`: Complete target data on just the typed SNPs. All genotypes are phased and non-missing. Just by-producted for generating other files; not used for anything downstream.

## Statistics on compressed reference panel

`MendelImpute` contains some hidden utility functions to quickly summarize a `.jlso` compressed haplotype reference panel. For instance,

In [12]:
# calculate number of unique haplotypes per window
haps_per_window = MendelImpute.count_haplotypes_per_window("ref.excludeTarget.jlso")

# calculate window width
window_width = MendelImpute.get_window_widths("ref.excludeTarget.jlso");

In [13]:
histogram(haps_per_window)

[90m                  ┌                                        ┐[39m 
   [0m[90m[[0m600.0[90m, [0m650.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇[39m[0m 1                                [90m [39m 
   [0m[90m[[0m650.0[90m, [0m700.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 6 [90m [39m 
   [0m[90m[[0m700.0[90m, [0m750.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 6 [90m [39m 
   [0m[90m[[0m750.0[90m, [0m800.0[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 3                   [90m [39m 
[90m                  └                                        ┘[39m 
[0m                                  Frequency

In [14]:
histogram(window_width)

[90m                  ┌                                        ┐[39m 
   [0m[90m[[0m312.0[90m, [0m312.2[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 8 [90m [39m 
   [0m[90m[[0m312.2[90m, [0m312.4[90m)[0m[90m ┤[39m[0m 0                                      [90m [39m 
   [0m[90m[[0m312.4[90m, [0m312.6[90m)[0m[90m ┤[39m[0m 0                                      [90m [39m 
   [0m[90m[[0m312.6[90m, [0m312.8[90m)[0m[90m ┤[39m[0m 0                                      [90m [39m 
   [0m[90m[[0m312.8[90m, [0m313.0[90m)[0m[90m ┤[39m[0m 0                                      [90m [39m 
   [0m[90m[[0m313.0[90m, [0m313.2[90m)[0m[90m ┤[39m[32m▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇[39m[0m 8 [90m [39m 
[90m                  └                                        ┘[39m 
[0m                                  Frequency

**Conclusion:** The compressed reference panel contains 16 windows of approximately 300 typed SNPs each. Within each window, there are approximately 600-800 unique haplotypes per window. 

# Dosage data

Since our software supports dosages as inputs (i.e. genotypes are real number in $[0, 2]$), let us also generate dosage data. For simplicity, we add a small perturbation $0.01*U(0, 1)$ to  the alternate allele count to get dosages. 

In [23]:
function write_snp!(io, X::AbstractMatrix, i::Int)
    x = @view(X[i, :]) # current record
    n = length(x)
    for j in 1:n
        y = round(0.01rand(), digits=3)
        if ismissing(x[j])
            print(io, "\t./.:.")
        elseif x[j] == 0
            print(io, "\t0/0:", y)
        elseif x[j] == 1
            print(io, "\t1/0:", 1 + y)
        elseif x[j] == 2
            print(io, "\t1/1:", 2 - y)
        else
            error("imputed genotypes can only be 0, 1, 2 but got $(x[j])")
        end
    end
    print(io, "\n")
    nothing
end

write_snp! (generic function with 1 method)

In [24]:
# import hard genotypes
X, X_sampleID, chr, pos, ids, ref, alt = 
    VCFTools.convert_gt(Float64, "target.typedOnly.masked.vcf.gz", trans=true, 
    save_snp_info=true, msg = "Importing genotype file...")
outfile = "target.typedOnly.dosages.masked.vcf.gz"

# generate VCF file with dosage data
io = openvcf(outfile, "w")
print(io, "##fileformat=VCFv4.2\n")
print(io, "##source=MendelImpute\n")
print(io, "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n")
print(io, "##FORMAT=<ID=DS,Number=1,Type=Float,Description=\"Dosages\">\n")
print(io, "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT")
for id in X_sampleID
    print(io, "\t", id)
end
print(io, "\n")
for i in 1:size(X, 1)
    print(io, chr[i], "\t", string(pos[i]), "\t", ids[i][1], "\t", 
        ref[i], "\t", alt[i][1], "\t.\tPASS\t.\t")
    print(io, "GT:DS")
    write_snp!(io, X, i)
end
close(io);

In [25]:
DS = convert_ds(Float64, "target.typedOnly.dosages.masked.vcf.gz")

100×5000 Array{Union{Missing, Float64},2}:
 0.002  1.002  1.003  0.01   0.001     …  1.009  0.006  0.01      1.998
 0.009  0.002  0.001  1.99   1.993        1.004  0.004  0.004     2.0
 0.001  1.995  2.0    0.005  0.006        1.997  0.001  0.005     1.996
 0.004  1.994  1.995  0.005  0.009        0.01   0.004  0.005     1.994
 1.004  1.003  1.006  0.008  0.008        1.991  0.002  0.003     2.0
 0.006  1.009  1.002  1.002   missing  …  1.002  0.004  0.006     1.994
 0.008  1.998  2.0    0.001  0.002        1.996  1.005  0.005     1.998
 1.003  1.998  1.994  0.002  0.0          1.004  0.0    0.001     1.998
 0.004  1.991  1.992  0.009  0.01         1.001  0.003  0.004     1.996
 0.003  2.0    1.996  0.009   missing     1.007  0.006  1.001     1.0
 0.008  1.998  1.994  0.001  0.003     …  0.008  0.007   missing  1.995
 0.005  1.999  1.998  0.004  0.004        0.005  0.004  1.0       1.009
 0.007  1.002  1.007  0.005  0.008        1.999  0.004  0.007     1.992
 ⋮                         