# Convert GT to numeric arrays

Most often we need to convert genetic data to numeric arrays (matrix of **minor allele counts**) for statistical analysis.

## Example VCF file

We need an example VCF file for demonstation. You can manually download it from [link](http://faculty.washington.edu/browning/beagle/tcest.08Jun17.d8b.vcf.gz) (877KB) and put the file in your current working directory. Or, within Julia,

In [1]:
isfile("test.08Jun17.d8b.vcf.gz") || download("http://faculty.washington.edu/browning/beagle/tcest.08Jun17.d8b.vcf.gz", 
    joinpath(pwd(), "test.08Jun17.d8b.vcf.gz"))
stat("test.08Jun17.d8b.vcf.gz")

StatStruct(mode=0o100644, size=876514)

The first 5 markers in this VCF file are

In [2]:
using VCFTools

fh = openvcf("test.08Jun17.d8b.vcf.gz", "r")
for l in 1:35
    println(readline(fh))
end
close(fh)

##fileformat=VCFv4.1
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Seque

We'd like to convert the genotype data (GT field) into the matrix of minor alleles.

## Genetic model

There are differnt SNP models. The *additive* SNP model essentially counts the number of **minor allele** (0, 1 or 2) per genotype. Other SNP models are *dominant* and *recessive*, both in terms of the **minor allele**. `VCFTools.jl` allows users to optionally check whether the `ALT` allele is the minor allele on-the-fly. When `ALT` allele is the minor allele, genotypes are translated to real number according to

| Genotype | VCF GT | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|:---:|:---:|:---:|:---:|:---:|  
| ALT,ALT | 0/0, 0&#124;0 | 2 | 1 | 1 |  
| REF,ALT | 0/1, 0&#124;1 | 1 | 1 | 0 |  
| REF,REF | 1/1, 1&#124;1 | 0 | 0 | 0 |  
| missing | . | Null | Null | Null | 

When `REF` allele is the minor allele, genotypes are translated according to

| Genotype | VCF GT | `model=:additive` | `model=:dominant` | `model=:recessive` |    
|:---:|:---:|:---:|:---:|:---:|  
| ALT,ALT | 0/0, 0&#124;0 | 0 | 0 | 0 |  
| REF,ALT | 0/1, 0&#124;1, 1/0, 1&#124;0 | 1 | 1 | 0 |  
| REF,REF | 1/1, 1&#124;1 | 2 | 1 | 1 |  
| missing | . | Null | Null | Null |

To properly record the missing genotypes, VCFTools convert VCF GT data to matrix `A` where element type of `A` is either a numeric number, or missing value. In Julia, this means `eltype(A) <: Union{Missing, Real}` where `<:` means "is a subtype".

## Convert GT data into a numeric matrix

Convert GT data in VCF file test.08Jun17.d8b.vcf.gz to a `Matrix{Union{Missing, Int8}}`. Here `as_minorallele = false` indicates that `VCFTools.jl` will copy the `0`s and `1`s of the file directly into `A`, without checking if ALT or REF is the minor allele. 

In [3]:
@time A = convert_gt(Int8, "test.08Jun17.d8b.vcf.gz"; as_minorallele = false, model = :additive, impute = false, center = false, scale = false)

  1.605007 seconds (6.51 M allocations: 373.169 MiB, 10.79% gc time)


191×1356 Array{Union{Missing, Int8},2}:
 0  0  0  0  1  0  0  0  0  0  0  0  2  …  0  0  0  0  1  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  2     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  1  0  0  0  0  0  0  0  2     0  0  0  0  1  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  2     0  0  0  0  1  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  2     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  1  0  0  0  0  0  0  0  2  …  0  0  0  0  1  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  2     0  0  0  0  0  0  0  0  0  1  1  0
 0  0  0  0  1  0  0  0  0  0  0  0  2     0  0  0  0  1  0  0  0  0  0  0  0
 0  0  0  0  1  0  0  0  0  0  0  0  2     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  2     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  2  …  0  0  1  0  1  0  0  0  0  1  1  0
 0  0  0  0  1  0  0  0  0  0  0  0  2     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0 

Convert GT data in VCF file test.08Jun17.d8b.vcf.gz to a `Nullable{Float64}` array. Check which of `ALT/REF` is the minor allele, impute the missing genotypes according to allele frequency, center the dosages around 2MAF, and scale the dosages by `sqrt(2MAF*(1-MAF))`.

In [4]:
@time A = convert_gt(Float64, "test.08Jun17.d8b.vcf.gz"; as_minorallele = true, model = :additive, impute = true, center = true, scale = true)

  0.325079 seconds (1.54 M allocations: 127.557 MiB, 5.49% gc time)


191×1356 Array{Union{Missing, Float64},2}:
 0.0  0.0  0.0  0.0   1.41301   0.0  …  0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0   1.41301   0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0   1.41301   0.0  …  0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0     0.0  0.0   2.36899    2.36899   0.0
 0.0  0.0  0.0  0.0   1.41301   0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0   1.41301   0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0  …  0.0  0.0   2.36899    2.36899   0.0
 0.0  0.0  0.0  0.0   1.41301   0.0     0.0  0.0  -0.390016  -0.390016  0.0
 0.0  0.0  0.0  0.0  -0.586138  0.0     0.0  

## Extract GT data marker-by-maker or window-by-window

Large VCF files easily generate numeric arrays that cannot fit into computer memory. Many analyses only need to loop over markers or sets of markers. This can be achieved by the `copy_gt!` function.

* To loop over all markers in the VCF file test.08Jun17.d8b.vcf.gz:

In [5]:
using GeneticVariation

# initialize VCF reader
people, snps = nsamples("test.08Jun17.d8b.vcf.gz"), nrecords("test.08Jun17.d8b.vcf.gz")
reader = VCF.Reader(openvcf("test.08Jun17.d8b.vcf.gz"))
# pre-allocate vector for marker data
g = zeros(Union{Missing, Float64}, people)
for j = 1:snps
    copy_gt!(g, reader; model = :additive, impute = true, center = true, scale = true)
    # do statistical anlaysis
end
close(reader)

* To loop over markers in windows of size 25:

In [6]:
# initialize VCF reader
people, snps = nsamples("test.08Jun17.d8b.vcf.gz"), nrecords("test.08Jun17.d8b.vcf.gz")
reader = VCF.Reader(openvcf("test.08Jun17.d8b.vcf.gz"))
# pre-allocate matrix for marker data
windowsize = 25
g = zeros(Union{Missing, Float64}, people, windowsize)
nwindows = ceil(Int, snps / windowsize)
for j = 1:nwindows
    copy_gt!(g, reader; model = :additive, impute = true, center = true, scale = true)
    # do statistical anlaysis
end
close(reader)

└ @ VCFTools /Users/biona001/.julia/dev/VCFTools/src/convert.jl:67


As the warning suggests, the last window has less than 25 markers. The remaining columns in the matrix `g` are set to missing values.