# GWAS Analysis using Linear Mixed Effect Models

In this tutorial we show you how to use the VarianceComponentModels.jl of the Open Mendel project to do a standard GWAS that accounts for relatedness or population substructure using as a random effect. The data used in this tutorial are from the Mendel version 16.0 option 29 (http://software.genetics.ucla.edu/download?package=1). They are simulated data and are freely available but please acknowledge the Open Mendel project if you use them. Strictly, because the data are only snps on chromosome 19, but with ~140,000 snps you will get an idea of the capabilities of Open Mendel. 

To use this tutorial you will need to have installed SnpArrays, VarianceComponentModels and MendelPlots from the Open Mendel project. To do so, please open julia in a terminal, then use the julia package manager. Add SnpArrays first, then VarianceComponentModels and finally MendelPlots. 

] #invokes the package manager

add https://github.com/OpenMendel/SnpArrays.jl.git

add https://github.com/OpenMendel/VarianceComponentModels.jl.git

add  https://github.com/OpenMendel/MendelPlots.jl.git

You will also need these registered packages, DataFrames, CSV, Distributions, DelimitedFIles, and LinearAlgebra and can add them using the package manager if you haven't already:

add Distributions.jl 

add DelimitedFiles.jl  

add LinearAlgebra.jl

add CSV.jl

add DataFrames.jl


The tutorial has been tested with julia 1.1.0


In [None]:
versioninfo()

### Load Required Packages  

In [None]:
# packages from openMendel
using SnpArrays,VarianceComponentModels
# packages from Julia base
using Distributions, DelimitedFiles, LinearAlgebra

### Read in the family structure and the trait

In this example we will use one of the two simulated traits found in the fam file.  We will also use sex as a covariate.  In the fam file sex is denoted as F or M.  We arbitrarily choose M (male) to be the reference group and so change M to 0 when we define the sex variable.  The effect of sex is the change in effect size from male to female, which is changed from a coding of F to 1. 

In [None]:
pedLMM = readdlm("SNP_29a.fam", ','; header = false)
Trait1 = convert(Vector{Float64}, pedLMM[:, 7])
# Trait2 = convert(Vector{Float64}, pedLMM[:, 8])
# Y = [Trait1 Trait2]
sex = map(x -> strip(x) == "F" ? 1.0 : 0.0,  pedLMM[:, 5])

We can check that the data were read in correctly by typing the name of the variable.

In [None]:
Trait1

### Read in genotypes and calculate GRM

We use SnpArrays to read in the binary snp file.  We also use SnpArrays to calculate the genetic relationship matrix (GRM). In this example we exclude any snps with a minor allele frequency (maf) less than 0.05. Using SNP with maf >0.05 helps insure that the GRM is accurate because rare variants can bias the GRM. 

In [None]:
snpbinLMM = SnpArray("SNP_29a.bed")
ex29agrm = grm(snpbinLMM; method = :GRM, minmaf=0.05)

We need to know the order of the snps in the bed file so we read in the bim file.  Of course we need to exclude those with maf >0.05 to match the set of snps we used in the GRM.

In [None]:
# columns are: :chrom, :snpid, :?, :pos, :allele1, :allele2
snpLMM = readdlm("SNP_29a.bim"; header = false)
snpLMM = snpLMM[maf(snpbinLMM) .> 0.05,:]
snpid = map(x -> strip(string(x)), snpLMM[:, 2])

### Setting up the data for the covariates.  I
In this case we have only sex as a covariate but we could have used other covariates as desired. The ones(n) sets of a variable that has value 1.0 for all individuals. This allows for the estimate of the grand mean $\mu$.

In [None]:
n, snps = size(snpbinLMM[:,maf(snpbinLMM) .> 0.05])
X = [ones(n) sex]
p = size(X,2)  # no. covariates
n, snps, p

### Prepare to fit LmmGWAS

First we analyze the data under the null model of no snp effects. The next three commands set up the data.  Then we need to decide which algorithm we wish to use to get our estimates. We have chosen the MM algorithm. Alternatively we could have used Fisher scoring (FS). The next set of commands then implements the optimization.  

In [None]:
# fit null model once to store nessary information for alternative model 
nulldata    = VarianceComponentVariate(Trait1, X, (2ex29agrm, Matrix{Float64}(I, n, n)))
nulldatarot = TwoVarCompVariateRotate(nulldata)
nullmodel   = VarianceComponentModel(nulldata)

In [None]:
algorithm = :MM

In [None]:
if algorithm == :MM
    logl_null,_,_,Σcov, = mle_mm!(nullmodel, nulldatarot; verbose = true)
elseif algorithm == :FS
    logl_null,_,_,Σcov, = mle_fs!(nullmodel, nulldatarot; verbose = true)
end

### Heritablity of `Trait1`

We now calculate the narrow sense heritability and its standard error.   The equation for heritability(h) is h = $\sigma^2_a / (\sigma^2_a + \sigma^2_e)$.  Note that in this version of the VarianceComponentModels.jl, we allow for only two variance components, the additive genetic variance and the dependent environmental variance.  In future implementations we will allow for more variance components.  

In [None]:
h, hse = heritability(nullmodel.Σ, Σcov)

The heritability of this simulated trait is rather on the high side ($72\%$) for a human trait, which explains why we can get away with only 212 individuals in this GWAS. 

## GWAS of Trait1

We now prepare our alternative models in order to conduct our GWAS of Trait1:

In [None]:
## fit alternative model with SNPs, push null model info to alternative model 
T = eltype(sex)
altdatarot = TwoVarCompVariateRotate(nulldatarot.Yrot,
    zeros(T, n, size(X, 1) + 1), nulldatarot.eigval, nulldatarot.eigvec,
    nulldatarot.logdetV2)
copyto!(altdatarot.Xrot, nulldatarot.Xrot) # last column ramains zero
altmodel = VarianceComponentModel(altdatarot)

### Loop over all SNPs to calculate LRT pvalues for LmmGWAS

The following routine shows you how you can write some simple julia code to execute a GWAS. This tutorial is set up to run all the snps. Note that if you are running this tutorial on your old laptop, then be prepared to wait a while for this step to finish. If you see the counter (for every 1000 snps processed) progressing, then the program is working so just be patient.  Alternatively you might wish to try out the tutorial for a much smaller example.  Then loop through only the first few snps, for example the first 100. If so, comment out with a # "@time for snp in 1:snps" and remove the # on "@time for snp in 1:testrun"

In [None]:
pvalue   = ones(snps)
genovec  = zeros(T, n)
testrun  = 100

snpsidx = vcat(1:size(snpbinLMM,2))[maf(snpbinLMM) .> 0.05]
#@time for snp in 1:testrun
@time for snp in 1:snps 
    # append (rotated) genotype vector to covariate matrix
    Base.copyto!(genovec, @view(snpbinLMM[:,snpsidx[snp]]), model=ADDITIVE_MODEL, center=true, scale=true, impute=true)
    tmp_mat = similar(genovec)
    LinearAlgebra.mul!(tmp_mat, transpose(altdatarot.eigvec), genovec)
    altdatarot.Xrot[:, end] = tmp_mat
    # initialize mean effects to null model fit
    fill!(altmodel.B, zero(T))
    copyto!(altmodel.B, nullmodel.B)
    copyto!(altmodel.Σ[1], nullmodel.Σ[1])
    copyto!(altmodel.Σ[2], nullmodel.Σ[2])
    # fit alternative model
    if algorithm == :MM
        logl_alt, vcmodel_mle, Σse, Σcov, Bse, Bcov = mle_mm!(altmodel, altdatarot; verbose = false)
    elseif algorithm == :FS
        logl_alt, = mle_fs!(altmodel, altdatarot; verbose = false)
    end
    # LRT statistics and its pvalue
    lrt = - 2(logl_null - logl_alt)
    pvalue[snp] = ccdf(Chisq(1), lrt)
#    println(snp,": ", hapmap_snpdata.snpid[snp], 
    if mod(snp, 1000) == 1
            println(snp)
    end    
#        "\n\tMAF: ", @sprintf("%0.3f", maf[snp]), 
#        "\n\tLRT p: ", @sprintf("%0.3f", pvalue[snp]))
end


### Output results to file
In some situations you may want to save GWAS results for future use, for example for use as part of a meta analysis. In the next set of commands, we show you how to make and save a comma delimited file with the snp id, the position of the snp, the chromosome location of the snp, the minor allele frequency, and the result provided as the negative log of the p-value. 

In [None]:
using DataFrames
using CSV

In [None]:
# maf, = summarize(snpbinLMM)
plot_frame = DataFrame(snpid = snpLMM[:,2],
   AdjBasepairs = snpLMM[:,4], 
   Chromosome = snpLMM[:,1], 
   MAF = maf(snpbinLMM)[maf(snpbinLMM).>0.05],
   Pvalue = pvalue)

CSV.write("lmmGWAS_output_pVal.txt", plot_frame)

### Manhattan Plot

One of the most common ways to display the results of a GWAS is as a plot of negative log base 10 of the pvalues versus chromosomal position, a Manhattan plot.  For your convenience, we have developed a Julia plotting module as part of the Open Mendel project, MendelPlots.  We demonstrate its use below.  The Manhattan plot will look at little different than the typical one because this example only includes markers from chromosome 19. 

In [None]:
plot_frame = CSV.read("lmmGWAS_output_pVal.txt")

In [None]:
using MendelPlots

manhattan(plot_frame; pvalvar = "Pvalue", chrvar = "Chromosome", 
    posvar = "AdjBasepairs", outfile = "lmmGWAS_manhattan.png", fontsize = 18pt, linecolor = "red")


In [None]:
 display("image/png", read("lmmGWAS_manhattan.png")) 

## Conclusions

This tutorial demonstrates how with just a little extra Julia coding, an Open Mendel user can use the VarianceComponentModels module to conduct a GWAS that takes into account possible relatedness or population substructure among individuals as a random effect.  