# Estimating Heritability and Testing SNP Association using Maximum Likelihoods of Variance Component Models

Authors: Sarah Ji, Janet Sinsheimer and Hua Zhou

We will use a variance component model to estimate heritability of a trait and then test for association to specified markers. This example is equivalent to a replication or a candidate SNP approach.  Normally if we had no prior hypothesis regarding particular loci (candidate gene approach), we would be first be testing markers genomewide using a GWAS approach that can handle pedigree data appropriately.  In that case, we would probably be using a fast score test approach rather than by maximum likelihood. However, maximum likelihood provides more accurate inference and parameters estimates and can be used to refine interference after screening. As an extension of the univariate case, at the end of the notebook we also demonstrate how to how to run a bivariate trait. We will also use the variance component frame work with maximum likelihood tests when conducting Mendelian Randomization with families so it's useful for us to understand how the Julia package, VarianceComponentModels.jl, works and how to load in large data sets into Julia using the package, SnpArrays.jl. 

Machine information:

In [1]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)


## Data files

As an application of the variance component model, this notebook demonstrates the workflow for heritability analysis in genetics, using a sample data set, `SNP_29C`, of **212** individuals and **253,141** SNPs from the Mendel version 16.0 sample input files. 

`SNP_29a.bed`, `SNP_29a.bim`, and `SNP_29C.fam` is the set of Plink files in binary format used in this notebook. The datafiles and software packages associated with Mendel can be downloaded for free on the UCLA genetics page. http://www.genetics.ucla.edu/software/mendel

For more information on Mendel version 16.0 see: Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570. http://www.genetics.ucla.edu/software/Mendel_current_doc.pdf

>SNP29C Dataset was simulated with a realistic linkage disequilibrium (LD) structure and constructed from phased sequence data from chromosome 19 on 85 individuals of northern and western European ancestry. After removing mono-allelic markers this set of individuals, 253,141 SNPs remained. Almost half of the SNPs have minor allele frequencies (MAF) below 5%. The haplotype pairs attributed to the 85 CEPH members were reassigned to the 85 founders of 27 pedigree structures selected from the Framingham Heart Study (FHS, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.framinghamheartstudy.org&d=DwIGaQ&c=UXmaowRpu5bLSLEQRunJ2z-YIUZuUoa9Rw_x449Hd_Y&r=wHxKeA_lx9mDDcQzMuXnw0mnB8LqTwv284d9wI2rbj8&m=n_5BED1Bi6R2LnYLvkHz5CsSDwtGyOxoRCauLhk2aS8&s=sFXWg1BPiQozvyRFoPhFmvRzGTeR8xgkbMeUgHHRRmU&e=). The selected Framingham pedigrees were chosen to reflect the kind of pedigrees commonly collected in family-based genetic studies. The 27 pedigrees encompass 212 people, range in size from 1 to 36 peo- ple and from 1 to 5 generations, and contain sibships of 1 to 5 children. The genotypes of non-founders were simulated, using Option 17, conditional on the haplotypes imposed on the founders. All genotypes were recorded as unordered for subsequent analyses.

Genome-wide QTL and eQTL analyses using Mendel, Hua Zhou, Jin Zhou, Tao Hu, Eric M. Sobel, Kenneth Lange https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_pmc_articles_PMC5133530_&d=DwIGaQ&c=UXmaowRpu5bLSLEQRunJ2z-YIUZuUoa9Rw_x449Hd_Y&r=wHxKeA_lx9mDDcQzMuXnw0mnB8LqTwv284d9wI2rbj8&m=n_5BED1Bi6R2LnYLvkHz5CsSDwtGyOxoRCauLhk2aS8&s=mmcAomV-h34hb3uYWxJWkUEN_XRGZ0ZmHAZyNakRb3s&e=


## Read in the data

Take a look at the first 10 lines of the pedigree file, SNP_29C.fam. The columns are comma separated. This file is in the classic Mendel format, Family Id, Person ID, Father ID, Mother Id, sex as F (female) or M (male), monozygotic twin indicator, Trait1 and Trait2.  The traits were simulated using Option 28 of the Mendel Software Program based on the major locus rs10412915. Here we simulated two correlated quantitative traits, Trait1 and Trait2.

Trait1 was simulated with a grand mean $\mu_1$ = 40, sex effect $\beta_{sex, 1}$ = 6, major locus effect $\beta_{snp, 1}$ = -1.5, additive variance $\sigma_{a1}$ = 4, and environmental variance $\sigma_{e1}$ = 2. 
Trait2 was simulated with a grand mean $\mu_2$ = 20, sex effect $\beta_{sex, 2}$ = 4, major locus effect $\beta_{snp, 2}$ = -1.5, additive variance $\sigma_{a2}$ = 4, and environmental variance $\sigma_{e2}$ = 2. 
The covariances between the traits are $\sigma_{a1, a2}$ = 1 and $\sigma_{e1, e2}$ = 0.

In [2]:
;head SNP_29C.fam

  1       ,  16      ,          ,          ,  F       ,          ,  30.20564,   9.24210,
  1       ,  8228    ,          ,          ,  F       ,          ,  35.82143,  15.27458,
  1       ,  17008   ,          ,          ,  M       ,          ,  36.05298,  19.50496,
  1       ,  9218    ,  17008   ,  16      ,  M       ,          ,  38.96351,  18.98575,
  1       ,  3226    ,  9218    ,  8228    ,  F       ,          ,  33.73911,  21.10412,
  2       ,  29      ,          ,          ,  F       ,          ,  34.88835,  19.01142,
  2       ,  2294    ,          ,          ,  M       ,          ,  37.70105,  19.16556,
  2       ,  3416    ,          ,          ,  M       ,          ,  45.13171,  19.84088,
  2       ,  17893   ,  2294    ,  29      ,  F       ,          ,  35.15599,  14.14228,
  2       ,  6952    ,  3416    ,  17893   ,  M       ,          ,  42.45136,  19.92713,


Read in the pedigree file into an array.

In [3]:
# columns are: :famid, :id, :moid, :faid, :sex, :twin, :Trait1,:Trait2
pedLMM = readcsv("SNP_29C.fam", Any; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  30.2056   9.2421  ""
     1   8228       "          "     "          "  35.8214  15.2746  ""
     1  17008       "          "     "          "  36.053   19.505   ""
     1   9218  17008                 "          "  38.9635  18.9857  ""
     1   3226   9218                 "          "  33.7391  21.1041  ""
     2     29       "          "  …  "          "  34.8884  19.0114  ""
     2   2294       "          "     "          "  37.7011  19.1656  ""
     2   3416       "          "     "          "  45.1317  19.8409  ""
     2  17893   2294                 "          "  35.156   14.1423  ""
     2   6952   3416                 "          "  42.4514  19.9271  ""
     2  14695   2294              …  "          "  35.6426  17.4191  ""
     2   6790   2294                 "          "  40.6344  23.6845  ""
     2   3916   2294                 "          "  34.8618  16.8684  ""
     ⋮                            ⋱  ⋮      

We don't need to retain the ids so we retrieve the two phenotypes and put them in an array Y.

In [4]:
Trait1 = convert(Vector{Float64}, pedLMM[:, 7])
Trait2 = convert(Vector{Float64}, pedLMM[:, 8])
Y = [Trait1 Trait2]

212×2 Array{Float64,2}:
 30.2056   9.2421
 35.8214  15.2746
 36.053   19.505 
 38.9635  18.9857
 33.7391  21.1041
 34.8884  19.0114
 37.7011  19.1656
 45.1317  19.8409
 35.156   14.1423
 42.4514  19.9271
 35.6426  17.4191
 40.6344  23.6845
 34.8618  16.8684
  ⋮              
 40.0522  21.5122
 39.3161  24.8508
 41.7913  22.5294
 36.3301  17.0813
 42.9442  17.1984
 39.8927  20.9043
 42.5795  15.9365
 47.8619  19.8943
 41.0531  25.1045
 39.9502  19.7227
 35.4778  21.935 
 44.3932  26.1222

We retrieve sex data coded as 0 (male) or 1 (female), which means male is the reference group.  You can change the 
code to sex = map(x -> strip(x) == "M"? 1.0 : 0.0,  pedLMM[:, 5]) if you want female to be the reference group. 

In [5]:
sex = map(x -> strip(x) == "F"? 1.0 : 0.0,  pedLMM[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 ⋮  
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Take a look at the first 10 lines of the SNP definition file before we read in into an array using a unix command.

In [7]:
;head SNP_29a.bim

19	rs3020701       	0	90974	1	2
19	rs56343121      	0	91106	1	2
19	rs143501051     	0	93542	1	2
19	rs56182540      	0	95981	1	2
19	rs7260412       	0	105021	1	2
19	rs11669393      	0	107866	1	2
19	rs181646587     	0	107894	1	2
19	rs8106297       	0	107958	1	2
19	rs8106302       	0	107962	1	2
19	rs183568620     	0	107987	1	2


Read in the SNP definition file into a Julia array.

In [8]:
# columns are: :chrom, :snpid, :?, :pos, :allele1, :allele2
snpLMM = readdlm("SNP_29a.bim"; header = false)

253141×6 Array{Any,2}:
 19  "rs3020701"    0     90974  1  2
 19  "rs56343121"   0     91106  1  2
 19  "rs143501051"  0     93542  1  2
 19  "rs56182540"   0     95981  1  2
 19  "rs7260412"    0    105021  1  2
 19  "rs11669393"   0    107866  1  2
 19  "rs181646587"  0    107894  1  2
 19  "rs8106297"    0    107958  1  2
 19  "rs8106302"    0    107962  1  2
 19  "rs183568620"  0    107987  1  2
 19  "rs186451972"  0    108003  1  2
 19  "rs189699222"  0    108032  1  2
 19  "rs182902214"  0    108090  1  2
  ⋮                                 ⋮
 19  "rs188169422"  0  59116080  1  2
 19  "rs144587467"  0  59117729  1  2
 19  "rs139879509"  0  59117949  1  2
 19  "rs143250448"  0  59117982  1  2
 19  "rs145384750"  0  59118028  1  2
 19  "rs149215836"  0  59118040  1  2
 19  "rs139221927"  0  59118044  1  2
 19  "rs181848453"  0  59118114  1  2
 19  "rs138318162"  0  59118148  1  2
 19  "rs186913222"  0  59118616  1  2
 19  "rs141816674"  0  59118779  1  2
 19  "rs150801216"  0  5911

We don't need the relative position of the snps in this case so we just retrieve SNP IDs.

In [9]:
snpid = map(x -> strip(string(x)), snpLMM[:, 2])

253141-element Array{AbstractString,1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

Read in the SNP binary file using the SnpArray.jl package.

In [10]:
using SnpArrays
snpbinLMM = SnpArray("SNP_29a")

[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m

212×253141 SnpArrays.SnpArray{2}:
 (true, true)  (true, true)   …  (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)   …  (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (false, true)  …  (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 ⋮                            ⋱                  ⋮             
 (true, true)  (true, true)   …  (true, true)    (false, false)
 (true

### Filtering the variant data to improve the quality of the GRM

First we get an idea of the minor allele frequencies. We can see by checking the quantiles that many of the loci are invariant or rather rare. By default the GRM function uses only variants with minor allele frequencies greater than 0.01 but we want to impose additional restrictions so that the MAF >0.05 and the percent success rate is >98% to avoid potential biases

In [11]:
maf, minor_allele, missings_by_snp, missings_by_person = summarize(snpbinLMM)

([0.0165094, 0.0825472, 0.00943396, 0.0872642, 0.0849057, 0.0259434, 0.0141509, 0.0589623, 0.0589623, 0.0188679  …  0.313679, 0.28066, 0.287736, 0.370283, 0.306604, 0.0707547, 0.0448113, 0.216981, 0.28066, 0.278302], Bool[true, true, false, true, true, true, false, true, true, false  …  true, true, true, false, false, true, false, false, true, false], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [12]:
quantile(maf, [0.0 .25 .5 .75 1.0])

1×5 Array{Float64,2}:
 0.00235849  0.0117925  0.0683962  0.228774  0.5

In [13]:
#first we filter out snps with genotype success rates < 98% and get the snp id's of snps with MAF>0.98
snp_idx, _ = filter(snpbinLMM, 0.98)

(Bool[true, true, true, true, true, true, true, true, true, true  …  true, true, true, true, true, true, true, true, true, true], Bool[true, true, true, true, true, true, true, true, true, true  …  true, true, true, true, true, true, true, true, true, true])

In [14]:
#now we find the index of the common snps (MAF greater than or equal to 0.05) with success rates >0.98
common_index = snp_idx .& (0.05 .≤ maf);

In [15]:
# now we put these snps into an array for use with the GRM function. 
data_common = snpbinLMM[ : , common_index]

212×137741 SnpArrays.SnpArray{2}:
 (true, true)   (true, true)   …  (true, true)    (false, false)
 (true, true)   (false, true)     (true, true)    (false, false)
 (true, true)   (true, true)      (true, true)    (false, false)
 (true, true)   (true, true)      (true, true)    (false, false)
 (true, true)   (true, true)      (true, true)    (false, false)
 (true, true)   (true, true)   …  (false, false)  (true, true)  
 (false, true)  (true, true)      (false, false)  (true, true)  
 (true, true)   (true, true)      (false, false)  (true, true)  
 (false, true)  (true, true)      (false, false)  (true, true)  
 (false, true)  (true, true)      (false, false)  (true, true)  
 (false, true)  (true, true)   …  (false, false)  (true, true)  
 (true, true)   (true, true)      (false, false)  (true, true)  
 (false, true)  (true, true)      (false, false)  (true, true)  
 ⋮                             ⋱                  ⋮             
 (true, true)   (true, true)   …  (true, true)    (false

## Kinship via Genetic Relationship Matrix (GRM)

Recall that in using variance components (linear mixed models) we need a measure of the relatedness among individuals. In this example we use the GRM, so that the estimate of the global kinship coefficient of individuals $i$ and $j$ is,
$$ \widehat\Phi_{GRMij} = \frac{1}{2S} \sum_{k=1}^S \frac{(x_{ik} -2p_k)(x_{jk} - 2p_k)}{2 p_k (1-p_k)}$$
where $k$ ranges over the selected $S$ SNPs, $p_k$ is the minor allele frequency of SNP $k$, and $x_{ik}$ is the number of minor alleles in individual $i$s genotype at SNP $k$.

## Calculate the GRM matrix

As mentioned above, by default, `grm` excludes SNPs with maf < 0.01 but we will use only the common snps (>0.05) with good success rates (>0.98). 

In [16]:
Φgrm = grm(data_common)

212×212 Array{Float64,2}:
  0.498264     0.0080878    0.0164327   …   0.0246825    0.00181856
  0.0080878    0.498054    -0.0212599      -0.0285927   -0.0226525 
  0.0164327   -0.0212599    0.499442       -0.0219661   -0.00748536
  0.253627    -0.00160532   0.282542        0.00612693  -0.00339125
  0.126098     0.253365     0.128931       -0.0158446   -0.00633959
 -0.014971    -0.00266073  -0.00243384  …   0.00384757   0.0145936 
 -0.0221357    0.0100492   -0.0107012      -0.0148443   -0.00127783
 -0.01629     -0.00749253  -0.015372       -0.0163305   -0.00258392
 -0.016679     0.00353587  -0.0128844      -0.0332489   -0.00707839
 -0.0176101   -0.00996912  -0.0158473      -0.00675875  -0.0122339 
 -0.0162558    0.00938592   0.0064231   …  -0.00510882   0.0168778 
 -0.0167487    0.00414544  -0.00936538     -0.0134863    0.0020952 
 -0.031148     0.00112387  -0.010794        0.00383105   0.0198635 
  ⋮                                     ⋱   ⋮                      
 -0.00865735  -0.00335

## Fit the null variance component model

Recall that we are using a variance component model with Trait1 as the outcome. Under the null hypothesis Trait1 is associated with sex (as a fixed effect).  We also need to account for the relatedness among individuals.  To do that we include a random effect and use the GRM matrix to describe the covariation structure. 
    $$ Y_{2i} = \mu +\beta_{sex} sex_i + A_i + e_i$$ 
    $$ A_i \sim N(0,\sigma^2_a)$$ $$e_i \sim N(0,\sigma^2_e)$$
    $$ Cov(Y_{2i},Y_{2j})=2\Phi_{ij} \sigma^2_a + 1_{i = j}\sigma^2_e$$

In [17]:
using VarianceComponentModels
# Null data model has two variance components but no SNP fixed effects
# form data as VarianceComponentVariate matrix 
# change the next two commands if you want to run trait 2 or both traits (Y)
X = [ones(length(Trait1)) sex]
nulldata = VarianceComponentVariate(Y[:,1], X, (2Φgrm, eye(length(Trait1))))

VarianceComponentModels.VarianceComponentVariate{Float64,2,Array{Float64,1},Array{Float64,2},Array{Float64,2}}([30.2056, 35.8214, 36.053, 38.9635, 33.7391, 34.8884, 37.7011, 45.1317, 35.156, 42.4514  …  41.7913, 36.3301, 42.9442, 39.8927, 42.5795, 47.8619, 41.0531, 39.9502, 35.4778, 44.3932], [1.0 1.0; 1.0 1.0; … ; 1.0 0.0; 1.0 0.0], ([0.996528 0.0161756 … 0.049365 0.00363711; 0.0161756 0.996108 … -0.0571855 -0.0453049; … ; 0.049365 -0.0571855 … 1.188 0.0994167; 0.00363711 -0.0453049 … 0.0994167 0.983485], [1.0 0.0 … 0.0 0.0; 0.0 1.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … 0.0 1.0]))

In [18]:
nullmodel = VarianceComponentModel(nulldata)

VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([0.0; 0.0], ([1.0], [1.0]), Array{Float64}(0,2), Char[], Float64[], -Inf, Inf)

In [19]:
@time nulllogl, nullmodel, = fit_mle!(nullmodel, nulldata; algo = :FS)


******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        3

Total number of variables............................:        2
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equa

(-475.21651097110305, VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([40.8917; -6.6255], ([4.06929], [2.15102]), Array{Float64}(0,2), Char[], Float64[], -Inf, Inf), ([0.963035], [0.558179]), [0.927437 -0.371979; -0.371979 0.311564], [0.173131; 0.307754], [0.0299744 -0.0433355; -0.0433355 0.0947127])

In [20]:
# null model log-likelihood for no SNP effects
nulllogl

-475.21651097110305

In [21]:
# null model mean effects - in this case a grand mean and a sex effect
nullmodel.B

2×1 Array{Float64,2}:
 40.8917
 -6.6255

In [22]:
# null model additive genetic variance
nullmodel.Σ[1]

1×1 Array{Float64,2}:
 4.06929

In [23]:
# null model environmental variance
nullmodel.Σ[2]

1×1 Array{Float64,2}:
 2.15102

### Heritability 
Calculate the proportion of the variance that can be attributed to additive genetic effects, the narrow sense heritability.  We calculate it here without any SNPs included. 

In [24]:
her_null = nullmodel.Σ[1]/(nullmodel.Σ[1] + nullmodel.Σ[2])

1×1 Array{Float64,2}:
 0.654194

# Fit the variance component model with SNPs as fixed effects

## Processing the SNP data
These data were simulated under a scenario in which one snp has a large main effect. First we find the index of that SNP, "rs10412915," and for the sake of demonstrating how to test for interactions we also find index of SNP "rs1036231." 

In [25]:
ind_rs10412915 = find(x -> x == "rs10412915", snpid)[1]
ind_rs1036231 = find(x -> x == "rs1036231", snpid)[1]

236108

Now we convert the SNP data into 0, 1, or 2 copies of the minor allele and form the data for the interaction of the two SNPs. 

In [26]:
snp_rs10412915 = convert(Vector{Float64}, snpbinLMM[:, ind_rs10412915])
snp_rs1036231 = convert(Vector{Float64}, snpbinLMM[:, ind_rs1036231])
interaction = snp_rs10412915 .* snp_rs1036231

212-element Array{Float64,1}:
 0.0
 4.0
 0.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 4.0
 4.0
 4.0
 ⋮  
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 4.0

## Look at the effect of a single SNP snp_rs10412915

We first test whether snp_rs10412915 has a significant effect. We form the correct design matrix Xalt with an ntercept, sex, and the snp of interest snp_rs10412915. Then we form the Variance Component Model by calling the VarainceComponents.jl package. 

In [27]:
# form data as VarianceComponentVariate - put the data in a form that VarianceComponentModels can use
Xalt = [ones(length(Trait1)) sex snp_rs10412915]
altdata = VarianceComponentVariate(Y[:, 1], Xalt, (2Φgrm, eye(length(Trait1))))

VarianceComponentModels.VarianceComponentVariate{Float64,2,Array{Float64,1},Array{Float64,2},Array{Float64,2}}([30.2056, 35.8214, 36.053, 38.9635, 33.7391, 34.8884, 37.7011, 45.1317, 35.156, 42.4514  …  41.7913, 36.3301, 42.9442, 39.8927, 42.5795, 47.8619, 41.0531, 39.9502, 35.4778, 44.3932], [1.0 1.0 0.0; 1.0 1.0 2.0; … ; 1.0 0.0 0.0; 1.0 0.0 2.0], ([0.996528 0.0161756 … 0.049365 0.00363711; 0.0161756 0.996108 … -0.0571855 -0.0453049; … ; 0.049365 -0.0571855 … 1.188 0.0994167; 0.00363711 -0.0453049 … 0.0994167 0.983485], [1.0 0.0 … 0.0 0.0; 0.0 1.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … 0.0 1.0]))

In [28]:
altmodel = VarianceComponentModel(altdata)

VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([0.0; 0.0; 0.0], ([1.0], [1.0]), Array{Float64}(0,3), Char[], Float64[], -Inf, Inf)

### Set the starting values for the maximum likelihood estimation
Use the null model estimates as start values for the alternative model.

In [29]:
altmodel.B[1:2, :] = nullmodel.B
altmodel.B

3×1 Array{Float64,2}:
 40.8917
 -6.6255
  0.0   

In [30]:
copy!(altmodel.Σ[1], nullmodel.Σ[1])
copy!(altmodel.Σ[2], nullmodel.Σ[2])
altmodel.Σ

([4.06929], [2.15102])

In [31]:
@time altlogl1, altmodel, = fit_mle!(altmodel, altdata; algo = :FS)

This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        3

Total number of variables............................:        2
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0  

(-466.8763652266441, VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([40.1067; -6.50856; 1.25051], ([3.37415], [2.2211]), Array{Float64}(0,3), Char[], Float64[], -Inf, Inf), ([0.860443], [0.533578]), [0.740362 -0.317562; -0.317562 0.284705], [0.251915; 0.297616; 0.298098], [0.0634613 -0.0449424 -0.0554297; -0.0449424 0.0885754 0.00754819; -0.0554297 0.00754819 0.0888623])

In [32]:
# alt model log-likelihood for the single SNP, snp_c1_1235710
altlogl1

-466.8763652266441

In [33]:
# alt model mean effects
altmodel.B

3×1 Array{Float64,2}:
 40.1067 
 -6.50856
  1.25051

In [34]:
# alt model additive genetic variance
altmodel.Σ[1]

1×1 Array{Float64,2}:
 3.37415

In [35]:
# alt model environmental variance
altmodel.Σ[2]

1×1 Array{Float64,2}:
 2.2211

Notice that the additive genetic variance and the environmental variance have both decreased.

To test the significance of the SNP, we use LRT

In [36]:
using Distributions
LRT1 = 2(altlogl1 - nulllogl)

16.680291488917874

In [37]:
#change the degrees of freedom if running a bivariate outcome
pval_snp_rs10412915 = ccdf(Chisq(1), LRT1)

4.423820951296014e-5

Although snp_rs10412915 has a small pvalue, the results aren't genomewide significant.

## Check for an interaction.  
### First calculate the log likelihood for additive effects of the two snps snp_rs10412915 and snp_rs1036231 without the interaction

Similar to the single snp case, we first form the design matrix Xalt2 with an intercept, sex, snp_rs10412915 and snp_rs1036231. Then we form the Variance Component Model by calling the VarainceComponents.jl package. 

In [38]:
# form data as VarianceComponentVariate
Xalt2 = [ones(length(Trait1)) sex snp_rs10412915 snp_rs1036231]
altdata2 = VarianceComponentVariate(Y[:,1], Xalt2, (2Φgrm, eye(length(Trait1))))

VarianceComponentModels.VarianceComponentVariate{Float64,2,Array{Float64,1},Array{Float64,2},Array{Float64,2}}([30.2056, 35.8214, 36.053, 38.9635, 33.7391, 34.8884, 37.7011, 45.1317, 35.156, 42.4514  …  41.7913, 36.3301, 42.9442, 39.8927, 42.5795, 47.8619, 41.0531, 39.9502, 35.4778, 44.3932], [1.0 1.0 0.0 0.0; 1.0 1.0 2.0 2.0; … ; 1.0 0.0 0.0 0.0; 1.0 0.0 2.0 2.0], ([0.996528 0.0161756 … 0.049365 0.00363711; 0.0161756 0.996108 … -0.0571855 -0.0453049; … ; 0.049365 -0.0571855 … 1.188 0.0994167; 0.00363711 -0.0453049 … 0.0994167 0.983485], [1.0 0.0 … 0.0 0.0; 0.0 1.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … 0.0 1.0]))

In [39]:
altmodel2 = VarianceComponentModel(altdata2)

VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([0.0; 0.0; 0.0; 0.0], ([1.0], [1.0]), Array{Float64}(0,4), Char[], Float64[], -Inf, Inf)

In [40]:
altmodel2.B[1:2, :] = nullmodel.B
altmodel2.B

4×1 Array{Float64,2}:
 40.8917
 -6.6255
  0.0   
  0.0   

In [41]:
copy!(altmodel2.Σ[1], nullmodel.Σ[1])
copy!(altmodel2.Σ[2], nullmodel.Σ[2])
altmodel2.Σ

([4.06929], [2.15102])

In [42]:
@time altlogl2, altmodel2, = fit_mle!(altmodel2, altdata2; algo = :FS)

This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        3

Total number of variables............................:        2
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0  

(-466.8547142107033, VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([40.1013; -6.5027; 1.05363; 0.203213], ([3.35452], [2.23228]), Array{Float64}(0,4), Char[], Float64[], -Inf, Inf), ([0.858738], [0.534305]), [0.737431 -0.317352; -0.317352 0.285482], [0.253095; 0.298687; 0.98677; 0.970398], [0.0640571 -0.0455503 -0.0324548 -0.0236183; -0.0455503 0.0892141 -0.0162411 0.0245267; -0.0324548 -0.0162411 0.973715 -0.912885; -0.0236183 0.0245267 -0.912885 0.941673])

In [43]:
altlogl2

-466.8547142107033

In [44]:
# alt model mean effects
altmodel2.B

4×1 Array{Float64,2}:
 40.1013  
 -6.5027  
  1.05363 
  0.203213

In [45]:
# alt model additive variance
altmodel2.Σ[1]

1×1 Array{Float64,2}:
 3.35452

In [46]:
# alt model environmental variance
altmodel2.Σ[2]

1×1 Array{Float64,2}:
 2.23228

### Test whether the addition of the second SNP improves the model fit by comparing the loglikelihood with just snp snp_rs10412915 to the loglikelihood with both snp_rs10412915 and snp_rs1036231

In [47]:
using Distributions
LRT2 = 2(altlogl2 - altlogl1)

0.04330203188158066

In [48]:
#change the degrees of freedom if running a bivariate outcome
pval_two_snps = ccdf(Chisq(1), LRT2)

0.8351575998764647

We see that adding snp_rs1036231 as an additional covariate to the single snp model with snp_rs10412915 does not explain more of the variation in Trait1.

## Check for evidence of an interaction between the two SNPs

Say, now you want to test for an interaction effect between snp_rs10412915 and snp_rs1036231. We first form the design matrix Xalt3 with an intercept, sex, snp_rs10412915, snp_rs1036231 and interaction. Then we create the Variance Component Model by calling the VarainceComponents.jl package. 

In [49]:
# form data as VarianceComponentVariate
Xalt3 = [ones(length(Trait1)) sex snp_rs10412915 snp_rs1036231 interaction]
altdata3 = VarianceComponentVariate(Trait1, Xalt3, (2Φgrm, eye(length(Trait1))))

VarianceComponentModels.VarianceComponentVariate{Float64,2,Array{Float64,1},Array{Float64,2},Array{Float64,2}}([30.2056, 35.8214, 36.053, 38.9635, 33.7391, 34.8884, 37.7011, 45.1317, 35.156, 42.4514  …  41.7913, 36.3301, 42.9442, 39.8927, 42.5795, 47.8619, 41.0531, 39.9502, 35.4778, 44.3932], [1.0 1.0 … 0.0 0.0; 1.0 1.0 … 2.0 4.0; … ; 1.0 0.0 … 0.0 0.0; 1.0 0.0 … 2.0 4.0], ([0.996528 0.0161756 … 0.049365 0.00363711; 0.0161756 0.996108 … -0.0571855 -0.0453049; … ; 0.049365 -0.0571855 … 1.188 0.0994167; 0.00363711 -0.0453049 … 0.0994167 0.983485], [1.0 0.0 … 0.0 0.0; 0.0 1.0 … 0.0 0.0; … ; 0.0 0.0 … 1.0 0.0; 0.0 0.0 … 0.0 1.0]))

Use the results of the two snp additive model as the starting point for the interaction model

In [50]:
altmodel3 = VarianceComponentModel(altdata3)

VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([0.0; 0.0; … ; 0.0; 0.0], ([1.0], [1.0]), Array{Float64}(0,5), Char[], Float64[], -Inf, Inf)

In [51]:
altmodel3.B[1:4, :] = altmodel2.B
altmodel3.B

5×1 Array{Float64,2}:
 40.1013  
 -6.5027  
  1.05363 
  0.203213
  0.0     

In [52]:
copy!(altmodel3.Σ[1], altmodel2.Σ[1])
copy!(altmodel3.Σ[2], altmodel2.Σ[2])
altmodel3.Σ

([3.35452], [2.23228])

In [53]:
@time altlogl3, altmodel3, = fit_mle!(altmodel3, altdata3; algo = :FS)

This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        3

Total number of variables............................:        2
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0  

(-466.79524084075194, VarianceComponentModels.VarianceComponentModel{Float64,2,Array{Float64,2},Array{Float64,2}}([40.0792; -6.50488; … ; 0.311402; -0.121256], ([3.37194], [2.21917]), Array{Float64}(0,5), Char[], Float64[], -Inf, Inf), ([0.859815], [0.533153]), [0.739281 -0.317078; -0.317078 0.284252], [0.261231; 0.298608; … ; 1.02201; 0.350703], [0.0682415 -0.0452335 … -0.044493 0.022822; -0.0452335 0.0891668 … 0.0231133 0.00156835; … ; -0.044493 0.0231133 … 1.0445 -0.112267; 0.022822 0.00156835 … -0.112267 0.122992])

In [54]:
altlogl3

-466.79524084075194

In [55]:
# alt model mean effects
altmodel3.B

5×1 Array{Float64,2}:
 40.0792  
 -6.50488 
  1.13837 
  0.311402
 -0.121256

In [56]:
# alt model additive variance
altmodel3.Σ[1]

1×1 Array{Float64,2}:
 3.37194

In [57]:
# alt model environmental variance
altmodel3.Σ[2]

1×1 Array{Float64,2}:
 2.21917

Test whether the interaction improves the model fit over the effects of the two SNPs alone

In [58]:
using Distributions
LRT3 = 2(altlogl3 - altlogl2)

0.1189467399027535

In [59]:
#change the degrees of freedom if running a bivariate outcome
pval_snp_interact = ccdf(Chisq(1), LRT3)

0.7301796542888648

We see that adding the interaction effect as an additional covariate does not explain more of the variation in Trait1.

Thus, we report that: 
    (1) The snp rs10412915 displays suggestive association but it is not genomewide significant. 
    (2) Adding the second snp_rs1036231 to the model with just snp_rs10412915 does not improve the model fit
    (3) Adding the interaction term does not improve the model fit over the effects of the two SNPS alone

Residual Heritability. The proportion of additive genetic variation remaining after including the SNPs and their interaction in the model.  

In [60]:
# ignore if running a bivariate outcome
her_alt = altmodel3.Σ[1]/(altmodel3.Σ[1] + altmodel3.Σ[2])

1×1 Array{Float64,2}:
 0.60309

Portion of the genetic variation explained by the snp is a measure of the effect of the snp on a signal trait. Note that in this simulated example the SNP effect is very large indeed. 

In [61]:
add_proport = (nullmodel.Σ[1] - altmodel3.Σ[1])/nullmodel.Σ[1]

1×1 Array{Float64,2}:
 0.171368

Portion of total variation explained by the snp is an alterative way to assess the effect of the snp. Again, typically the effects are not nearly so large.  

In [62]:
pheno_proport = (nullmodel.Σ[1] + nullmodel.Σ[2] - altmodel3.Σ[1] - altmodel3.Σ[2])/(nullmodel.Σ[1] + nullmodel.Σ[2])

1×1 Array{Float64,2}:
 0.101151

# Pairwise Trait Analysis

When we ran with just Trait1, snp_rs10412915 was not genomewide significant. However the 29c data have two traits, so we now test if the reference snp_rs10412915 is genomewide significant when we use both traits simultaneously. Similar to the univariate trait analysis above, we will perform a Likelihood Ratio Test to test for significance. 

Following code snippet does the joint analysis of n_traits = 2 traits. Note that if the dataset has more than two traits, the user can just pre-assign n_traits to the number of traits in the dataset and all pairwise analyses will be conducted.

In [63]:
n_traits = 2

2

In [64]:
# form data as VarianceComponentVariate
SNP_29Cdata_emp_null = VarianceComponentVariate(Y, X, (2Φgrm, eye(size(Y, 1))))
#fieldnames(SNP_29Cdata_emp)
SNP_29Cdata_rotated_emp_null = TwoVarCompVariateRotate(SNP_29Cdata_emp_null)

VarianceComponentModels.TwoVarCompVariateRotate{Float64,Array{Float64,2},Array{Float64,2}}([-551.252 -268.956; -0.358034 -0.782546; … ; -1.34343 -4.87381; 0.75091 4.59537], [-14.5602 -6.66199; -6.6662e-14 0.197432; … ; 7.23163e-15 -0.0797677; 6.91634e-15 0.142873], [5.78835e-15, 0.0752044, 0.0877463, 0.0915643, 0.0967969, 0.104256, 0.108283, 0.109661, 0.110979, 0.114564  …  3.3852, 3.71687, 3.9086, 4.11158, 4.26956, 4.56256, 5.13676, 5.492, 6.11812, 6.87637], [-0.0686803 0.0119184 … 0.00207223 0.00503436; -0.0686803 0.00929458 … 0.0183992 -0.00115174; … ; -0.0686803 0.00239184 … -0.0203006 0.0363221; -0.0686803 0.000513356 … -0.0033856 0.0127975], 0.0)

In [65]:
# additive genetic effects (2x2 psd matrices) from bivariate trait analysis;
Σa = Array{Matrix{Float64}}(2, 2)
# environmental effects (2x2 psd matrices) from bivariate trait analysis;
Σe = Array{Matrix{Float64}}(2, 2)

println("Trait1, Trait2 Null Model")
# form data set for (trait1, trait2)

tic()
for i in 1:n_traits
    for j in (i+1):n_traits
    traitij_data = TwoVarCompVariateRotate(SNP_29Cdata_rotated_emp_null.Yrot[:, [i;j]], 
    SNP_29Cdata_rotated_emp_null.Xrot, SNP_29Cdata_rotated_emp_null.eigval, 
    SNP_29Cdata_rotated_emp_null.eigvec, SNP_29Cdata_rotated_emp_null.logdetV2)
    # initialize model parameters
    traitij_model = VarianceComponentModel(traitij_data)
    # estimate variance components
    maxlogl, _, Σse, Σcov, Bse, Bcov = mle_fs!(traitij_model, traitij_data; solver=:Ipopt, verbose=true)
    Σa[i, j] = traitij_model.Σ[1]
    Σe[i, j] = traitij_model.Σ[2]
    @show Σa[i, j], Σe[i, j]
    @show traitij_model.B
    @show maxlogl
    end
end
toc()

Trait1, Trait2 Null Model
This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:       21

Total number of variables............................:        6
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) al

3.391869234

The maximum loglikelihood value of the null model is `maxlogl = -949.6093318451558`

In [66]:
# form data as VarianceComponentVariate
SNP_29Cdata_emp = VarianceComponentVariate(Y, Xalt, (2Φgrm, eye(size(Y, 1))))
#fieldnames(SNP_29Cdata_emp)
SNP_29Cdata_rotated_emp = TwoVarCompVariateRotate(SNP_29Cdata_emp)

VarianceComponentModels.TwoVarCompVariateRotate{Float64,Array{Float64,2},Array{Float64,2}}([-551.252 -268.956; -0.358034 -0.782546; … ; -1.34343 -4.87381; 0.75091 4.59537], [-14.5602 -6.66199 -8.51635; -6.6662e-14 0.197432 0.312084; … ; 7.23163e-15 -0.0797677 0.332059; 6.91634e-15 0.142873 0.302486], [5.78835e-15, 0.0752044, 0.0877463, 0.0915643, 0.0967969, 0.104256, 0.108283, 0.109661, 0.110979, 0.114564  …  3.3852, 3.71687, 3.9086, 4.11158, 4.26956, 4.56256, 5.13676, 5.492, 6.11812, 6.87637], [-0.0686803 0.0119184 … 0.00207223 0.00503436; -0.0686803 0.00929458 … 0.0183992 -0.00115174; … ; -0.0686803 0.00239184 … -0.0203006 0.0363221; -0.0686803 0.000513356 … -0.0033856 0.0127975], 0.0)

In [67]:
# additive genetic effects (2x2 psd matrices) from bivariate trait analysis;
Σa = Array{Matrix{Float64}}(2, 2)
# environmental effects (2x2 psd matrices) from bivariate trait analysis;
Σe = Array{Matrix{Float64}}(2, 2)

println("Trait1, Trait2 Alternative Model")
# form data set for (trait1, trait2)

tic()
for i in 1:n_traits
    for j in (i+1):n_traits
    traitij_data = TwoVarCompVariateRotate(SNP_29Cdata_rotated_emp.Yrot[:, [i;j]], 
    SNP_29Cdata_rotated_emp.Xrot,SNP_29Cdata_rotated_emp.eigval, 
    SNP_29Cdata_rotated_emp.eigvec, SNP_29Cdata_rotated_emp.logdetV2)
    # initialize model parameters
    traitij_model = VarianceComponentModel(traitij_data)
    # estimate variance components
    maxlogl, _, Σse, Σcov, Bse, Bcov = mle_fs!(traitij_model, traitij_data; solver=:Ipopt, verbose=true)
    Σa[i, j] = traitij_model.Σ[1]
    Σe[i, j] = traitij_model.Σ[2]
    @show Σa[i, j], Σe[i, j]
    @show traitij_model.B
    @show maxlogl
    end
end
toc()

Trait1, Trait2 Alternative Model
This is Ipopt version 3.12.8, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:       21

Total number of variables............................:        6
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg

0.190632486

Notice that the maximum loglikelihood value of the alternative model has increased. `maxlogl = -930.8141679806174`

### Bivariate Trait Likelihood Ratio Test

We can now perform a Likelihood Ratio Test to check for genotype wide significance of the major locus rs10412915 from the bivariate trait analysis. 

In [68]:
LRT4 = 2(-930.8141679806174- -949.6093318451551)

37.5903277290754

Recall that when performing a Likelihood Ratio Test for a bivariate outcome, the degrees of freedom change from 1 to 2. 

In [69]:
pval_bivariate_trait = ccdf(Chisq(2), LRT4)

6.876446162965101e-9

Now we see that the results of the association are genomewide significant. 