# VarianceComponentModels.jl

[`VarianceComponentModels.jl`](https://github.com/OpenMendel/VarianceComponentModels.jl/) is a package that resides in [OpenMendel](https://github.com/OpenMendel) ecosystem. It implements computation routines for fitting and testing variance component model of form 

$$\text{vec}(Y) \sim \text{Normal}(XB, \Sigma_1 \otimes V_1 + \cdots + \Sigma_m \otimes V_m)$$

where $\otimes$ is the [Kronecker product](https://en.wikipedia.org/wiki/Kronecker_product). 




### Package Features 
* Maximum likelihood estimation (MLE) and restricted maximum likelihood estimation (REML) of mean parameters B and variance component parameters Σ
* Allow constrains in the mean parameters B
* Choice of optimization algorithms: [Fisher scoring](https://books.google.com/books?id=QYqeYTftPNwC&lpg=PP1&pg=PA142#v=onepage&q&f=false) and [minorization-maximization algorithm](http://hua-zhou.github.io/media/pdf/ZhouHuZhouLange19VCMM.pdf)
* [Heritability Analysis](https://openmendel.github.io/VarianceComponentModels.jl/latest/man/heritability/#Heritability-Analysis-1) in genetics

### Installation

This package requires Julia v0.7.0 or later, which can be obtained from https://julialang.org/downloads/ or by building Julia from the sources in the https://github.com/JuliaLang/julia repository.

The package has not yet been registered and must be installed using the repository location. Start julia and use the ] key to switch to the package manager REPL

```julia
(v1.3) pkg> add https://github.com/OpenMendel/VarianceComponentModels.jl.git#Julia-0.7
```

Use the backspace key to return to the Julia REPL.

### Outline
* [MLE and REML](#MLE-and-REML) 
* [Heritability Analysis](#Heritability-Analysis)

## Heritability Analysis 

As an application of the variance component model, this note demonstrates the workflow for heritability analysis in genetics, using a sample data set `cg10k` with **6,670** individuals and **630,860** SNPs. Person IDs and phenotype names are masked for privacy. `cg10k.bed`, `cg10k.bim`, and `cg10k.fam` is a set of Plink files in binary format. `cg10k_traits.txt` contains 13 phenotypes of the 6,670 individuals.

In [1]:
;ls cg10k.bed cg10k.bim cg10k.fam cg10k_traits.txt

ls: cg10k.bed: No such file or directory
ls: cg10k.bim: No such file or directory
ls: cg10k.fam: No such file or directory
ls: cg10k_traits.txt: No such file or directory


Machine information:

In [2]:
versioninfo()

Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


In [7]:
cd("/Users/juhyun-kim/Box Sync/workspace/vcselect/data/copd_exome")

### Read in binary SNP data 

We will use the `SnpArrays.jl` package to read in binary SNP data and compute the empirical kinship matrix. The package has not yet been registered and must be installed using the repository location. Start julia and use the ] key to switch to the package manager REPL



```julia
(v0.7) pkg> add https://github.com/OpenMendel/SnpArrays.jl.git#juliav0.7
```

Use the backspace key to return to the Julia REPL.

In [4]:
using SnpArrays

┌ Info: Precompiling SnpArrays [4e780e97-f5bf-4111-9dc4-b70aaf691b06]
└ @ Base loading.jl:1273


In [8]:
# read in genotype data from Plink binary file (~50 secs on my laptop)
@time cg10k = SnpArray("copdGeneSnpFunc.bed")

  0.861086 seconds (2.23 M allocations: 115.053 MiB, 23.74% gc time)


416×108451 SnpArray:
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x02  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    

### Summary statistics of SNP data

In [None]:
people, snps = size(cg10k)

The positions of the missing data are evaluated by

In [None]:
mp = missingpos(cg10k)

The number of missing data values in each column can be evaluated as

In [None]:
missings_by_snp = sum(mp, dims=1)

Minor allele frequencies (MAF) for each SNP.

In [None]:
maf_cg10k = maf(cg10k)

In [None]:
# 5 number summary and average MAF (minor allele frequencies)
using Statistics
Statistics.quantile(maf_cg10k, [0.0 .25 .5 .75 1.0]), mean(maf_cg10k)

In [None]:
#using Pkg
#pkg "add Plots"
#pkg"add PyPlot"

using Plots
gr(size=(600,500), html_output_format=:png)
histogram(maf_cg10k, xlab = "Minor Allele Frequency (MAF)", label = "MAF")b

In [None]:
# proportion of missing genotypes
sum(missings_by_snp) / length(cg10k)

In [None]:
# proportion of rare SNPs with maf < 0.05
count(!iszero, maf_cg10k .< 0.05) / length(maf_cg10k)

### Empirical kinship matrix

We estimate empirical kinship based on all SNPs by the genetic relation matrix (GRM). Missing genotypes are imputed on the fly by drawing according to the minor allele frequencies.

In [None]:
## GRM using SNPs with maf > 0.01 (default) (~10 mins on my laptop)
using Random 
Random.seed!(123)
@time Φgrm = grm(cg10k; method = :GRM)

### Phenotypes 

Read in the phenotype data and compute descriptive statistics.

In [None]:
#using Pkg
#pkg"add CSV DataFrames"
#using CSV, DataFrames

cg10k_trait = CSV.File("cg10k_traits.txt"; 
            delim = ' ') |> DataFrame
names!(cg10k_trait, [:FID; :IID; :Trait1; :Trait2; :Trait3; :Trait4; :Trait5; :Trait6; 
             :Trait7; :Trait8; :Trait9; :Trait10; :Trait11; :Trait12; :Trait13])
# do not display FID and IID for privacy
cg10k_trait[:, 3:end]

In [None]:
Y = convert(Matrix{Float64}, cg10k_trait[:, 3:15])
histogram(Y, layout = 13)

### Pre-processing data for heritability analysis

To prepare variance component model fitting, we form an instance of VarianceComponentVariate. The two variance components are $(2\Phi, I)$.

In [None]:
using VarianceComponentModels, LinearAlgebra

# form data as VarianceComponentVariate
cg10kdata = VarianceComponentVariate(Y, (2Φgrm, Matrix(1.0I, size(Y, 1), size(Y, 1))))
fieldnames(typeof(cg10kdata))

In [None]:
cg10kdata

Before fitting the variance component model, we pre-compute the eigen-decomposition of $2\Phi_{\text{GRM}}$, the rotated responses, and the constant part in log-likelihood, and store them as a TwoVarCompVariateRotate instance, which is re-used in various variane component estimation procedures.

In [5]:
# pre-compute eigen-decomposition (~50 secs on my laptop)
@time cg10kdata_rotated = TwoVarCompVariateRotate(cg10kdata)
fieldnames(typeof(cg10kdata_rotated))

UndefVarError: UndefVarError: TwoVarCompVariateRotate not defined