# Association Testing Using Variance Component Model

In this notebook, we read in the Mendel Option 29 (Ped-GWAS) data and demonstrate association study using our most associated SNP and trait2. Is a consequence of the analysis we can also estimate the heritability. If you want to test your Julia skills, try changing the code to (1) run another snp (or a set of snps); (2) run the same analysis with trait 1 or (3) run the bivariate analysis (both traits).  Note that for the bivariate analysis concept of heritability is not well defined so omit that part. 



Acknowledgement: Hua Zhou wrote the vast majority of this notebook with a little tweaking by Janet Sinsheimer

## Data files

We start from the following 3 files from [Mendel Option 29 (Ped-GWAS) example](https://www.genetics.ucla.edu/software/Mendel_current_doc.pdf#page=294). Following shell commands assumes MacOS or Linux environment. Julia commands should run regardless of OS.

In [None]:
;ls -l Ped29c.in SNP_data29a.bin SNP_def29a.in

Because SnpArray function requires input file name ending in .bed rather than .bin, we create a symbolic link SNP_data29a.bed to SNP_data29a.bin.  (If you have trouble with getting this command to work on your computer you can copy the file outside of julia).


In [None]:
;ln -s ./SNP_data29a.bin ./SNP_data29a.bed

In [None]:
;ls

## Read in Mendel Option 29 data

Take a look at the first 10 lines of the pedigree file.

In [None]:
;head Ped29c.in

Read in the pedigree file. This file is in the classic Mendel format, Family Id, Person ID, Father ID, Mother Id, sex as F (female) or M (male), monozygotic twin indicator, simtrait1 and simtrait2. 

In [None]:
# columns are: :famid, :id, :moid, :faid, :sex, :twin, :simtrait1, :simtrait2, :group
ped29c = readcsv("Ped29c.in", Any; header = false)

We don't need to retain the ids so we retrieve the two phenotype data and put them in an array Y.

In [None]:
simtrait1 = convert(Vector{Float64}, ped29c[:, 7])
simtrait2 = convert(Vector{Float64}, ped29c[:, 8])
Y = [simtrait1 simtrait2]

Retrieve sex data coded as 0 (male) or 1 (female) so male is the reference group.

In [None]:
sex = map(x -> strip(x) == "F"? 1.0 : 0.0,  ped29c[:, 5])

Take a look at the first 10 lines of the SNP definition file.

In [None]:
;head SNP_def29a.in

Read in the SNP definition file, skipping the first 2 lines.

In [None]:
# columns are: :snpid, :chrom, :pos, :allele1, :allele2, :groupname
snpdef29c = readcsv("SNP_def29a.in", Any; skipstart = 2, header = false)

We will be analyzing SNPs one at a time so we don't need the position of the snps just the SNP IDs so we retrieve SNP IDs.

In [None]:
snpid = map(x -> strip(string(x)), snpdef29c[:, 1])

Read in the SNP binary file using the SnpArray.jl package.

In [None]:
using SnpArrays

snpbin29a = SnpArray("SNP_data29a"; people = size(ped29c, 1), snps = size(snpdef29c, 1))

## Kinship via Genetic Relationship Matrix (GRM)

Recall that in using variance components (linear mixed models) we need a measure of the relatedness among individuals. Under the GRM formulation, the estimate of the global kinship coefficient of individuals $i$ and $j$ is
$$ \widehat\Phi_{GRMij}^  = \frac{1}{2S} \sum_{k=1}^S \frac{(x_{ik} -2p_k)(x_{jk} - 2p_k)}{2 p_k (1-p_k)}$$,
where $k$ ranges over the selected $S$ SNPs, $p_k$ is the minor allele frequency of SNP $k$, and $x_{ik}$ is the number of minor alleles in individual $i$s genotype at SNP $k$.

## Calculate the GRM matrix

By default, `grm` excludes SNPs with maf < 0.01.

In [None]:
Φgrm = grm(snpbin29a; method = :GRM)

## Fit the null variance component model

Recall that we are using a variance component model with simtrait2 as the outcome. Under the null hypothesis simtrait2 is associated with sex (as a fixed effect).  We also need to account for the relatedness among individuals.  To do that we include a random effect and use the GRM matrix to describe the covariation structure. 
    $$ Y_{2i} = \mu +\beta_{sex} sex_i + A_i + e_i$$ 
    $$ A_i \sim N(0,\sigma^2_a)$$ $$e_i \sim N(0,\sigma^2_e)$$
    $$ Cov(Y_{2i},Y_{2j})=2\Phi_{ij} \sigma^2_a + 1_{i = j}\sigma^2_e$$

In [None]:
using VarianceComponentModels

# form data as VarianceComponentVariate
X = [ones(length(simtrait1)) sex]
#change this next command if you want to run trait 1 or both traits (Y)
nulldata = VarianceComponentVariate(Y[:,2], X, (2Φgrm, eye(length(simtrait2))))

When we run the alternative model, it can be helpful to start from our best estimates from the null model. Initialize the variance component model parameters.

In [None]:
nullmodel = VarianceComponentModel(nulldata)

In [None]:
@time nulllogl, nullmodel, = fit_mle!(nullmodel, nulldata; algo = :FS)

In [None]:
# null model log-likelihood
nulllogl

In [None]:
# null model mean effects
nullmodel.B

In [None]:
# null model additive genetic variance
nullmodel.Σ[1]

In [None]:
# null model environmental variance
nullmodel.Σ[2]

### Heritability 
Calculate the proportion of the variance that can be attributed to additive genetic effects, the narrow sense heritability.  

In [None]:
her_null = nullmodel.Σ[1]/(nullmodel.Σ[1]+nullmodel.Σ[2])

## Fit variance component model with the causal SNP

These data were simulated under a scenario so that a male has a value of 20 and a female has a value of 16. The trait is simulated with a major locus, rs10412915, with an additive effect of 1.5 per minor allele such that a heterozygote male has a value of 20. There is also a strong residual genetic variation.

In [None]:
ind_rs10412915 = find(x -> x == "rs10412915", snpid)[1]
# Use can change this SNP if you would like to assess another's snps effect on the trait, e.g.:
#ind_rs56343121 = find(x -> x == "rs56343121", snpid)[1]

In [None]:
snp_rs10412915 = convert(Vector{Float64}, snpbin29a[:, ind_rs10412915])
#snp_rs56343121 = convert(Vector{Float64}, snpbin29a[:, ind_rs56343121])

In [None]:
# form data as VarianceComponentVariate
Xalt = [ones(length(simtrait2)) sex snp_rs10412915]
#Xalt = [ones(length(simtrait1)) sex snp_rs56343121]
altdata = VarianceComponentVariate(Y[:,2], Xalt, (2Φgrm, eye(length(simtrait2))))

In [None]:
altmodel = VarianceComponentModel(altdata)

### Set the starting values for the maximum likelihood estimation
Use the null model estimates as start values for the alternative model.

In [None]:
altmodel.B[1:2, :] = nullmodel.B
altmodel.B

In [None]:
copy!(altmodel.Σ[1], nullmodel.Σ[1])
copy!(altmodel.Σ[2], nullmodel.Σ[2])
altmodel.Σ

In [None]:
@time altlogl, altmodel, = fit_mle!(altmodel, altdata; algo = :FS)

In [None]:
# alt model log-likelihood
altlogl

In [None]:
# alt model mean effects
altmodel.B

In [None]:
# alt model additive genetic variance
altmodel.Σ[1]

In [None]:
# alt model environmental variance
altmodel.Σ[2]

To test the significance of the SNP, we use LRT

In [None]:
using Distributions
LRT=2(altlogl - nulllogl)

In [None]:
#change the degrees of freedom if running a bivariate outcome
pval_rs10412915 = ccdf(Chisq(1), LRT)

Residual Heritability. The proportion of additive genetic variation remaining after including the SNP in the model.  Note that heritability is difficult to describe for a bivariate outcome so it is usually not provided. 

In [None]:
# ignore if running a bivariate outcome
her_alt=altmodel.Σ[1]/(altmodel.Σ[1]+altmodel.Σ[2])

Portion of the genetic variation explained by the snp is a measure of the effect of the snp on a signal trait. Omit if running a bivariate trait. 

In [None]:
add_proport=(nullmodel.Σ[1]-altmodel.Σ[1])/nullmodel.Σ[1]

Portion of total variation explained by the snp is an alterative to the above. Omit if running a bivariate trait. 

In [None]:
pheno_proport=(nullmodel.Σ[1]+nullmodel.Σ[2]-altmodel.Σ[1]-altmodel.Σ[2])/(nullmodel.Σ[1]+nullmodel.Σ[2])