# Mendelian Randomization (MR) using Variance Component Models (VCM)

Authors: Sarah Ji, Jin Zhou, Janet Sinsheimer, Hua Zhou

In this notebook, we use simulated data in order to demonstrate how to conduct Mendelian randomization as a method of causal inference. This demonstration differs from many others you will find in that we use family data.  

We don't go into the assumptions, caveats, extension, disadvantages or advantages of Mendelian randomization or Variance Components (a.k.a. Linear Mixed Models). For these aspects we refer the user to the book "Mendelian Randomization: Methods for using Genetic Variants in Causal Estimation" by Steven Burgess and Simon G. Thompson. For the theory behind variance component models, see for example "Mathematical and Statistical Methods for Genetic Analysis" by Kenneth Lange. 

The notebook is organized as follows: 

Example 1: This first example provides the user with an example in which the genetic variant is a strong instrumental variable (IV) and there is a direct effect of the exposure on the trait.

Example 2: In this second example there is still a strong IV but any association between the trait and the environmental predictor is due to confounding. 

Example 3: In this third example, we demonstrate that there can be considerable bias if the marker is a weak IV.  

The data for each of these examples were simulated from a subset of the GAW data.  

## Two-Stage Regression Method for Mendelian Randomization using Variance Component Models

There are a number of different approaches for carrying out MR. In order to estimate the causal effect of a quantitative exposure on a quantitative trait, we will be using a two-stage regression method, in which we regress a continuous outcome on a genotype (the IV) and any suspected predictors/confounders and we regress a continous exposure on a genotype and the same suspected predictors/confounders. We then calculate the estimate of the direct effect of the exposure on the trait using their ratio. This method has the advantage that, provided the families are drawn from the same underlying populations, we can use different families in the two samples. In our examples however, we use the same families.  

First we introduce the module MRVC which calls the variance components model along with the other modules needed to run variance component models. The variance component module used by Julia can estimate the fixed effects using either an MM algorithm or Fisher scoring.  We use Fisher scoring in the examples below. 

In general the optimization algorithm used to maximize the loglikelihood needs to invert the $nd$ by $nd$ overall covariance matrix $\Omega = \Sigma_1 \otimes V_1 + \cdots + \Sigma_m \otimes V_m$ in each iteration. Inverting this matrix is computationally expensive with $O(n^3 d^3)$ floating operations. When there are only two variance components ($m=2$), this inversion can be avoided by taking one (generalized) eigendecomposion of $(V_1, V_2)$ and rotating data $(Y, X)$ by the eigenvectors.

## Preliminaries - the Julia modules used in the analysis:

For our convenience we will create two new Julia functions to use in this analysis.  We could made separate programs for you to call but by including them in this notebook, you can see a little bit of Julia programming. If you are familiar with R or Matlab you will see some similaries in syntax. 

Machine Info:

In [1]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)


Be sure to make note of your own machine information, as this is a very important check for reproducibility!

## Add OpenMendel Packages

If it is the first time using these two packages, run the following command using the package manager on ther terminal:
</br>

`pkg> add https://github.com/OpenMendel/VarianceComponentModels.jl.git`



`pkg> add https://github.com/OpenMendel/SnpArrays.jl.git`

In [2]:
using SnpArrays
using DataFrames
using StatsBase
using VarianceComponentModels
using GLM
using Distributions
using LinearAlgebra

In [3]:
#function MR needs as input (trait, design_matrix, IV, exposure, GRM, and algorithm (MM or Fisher scoring)##
function MendelianRandomization(Y::AbstractVecOrMat,
    X::AbstractVecOrMat,
    #IV::SnpArrays.SnpArray{2},
    IV::AbstractVecOrMat,
    exposure::AbstractVecOrMat,
    ΦGRM::AbstractMatrix,
    algorithm::Symbol = :FS)

    T = eltype(X)
    n = length(IV)

    ## fit null model without IV effects ##
    ### set up the appropriate data structures and covariance matrices ### 
    nulldata    = VarianceComponentVariate([Y exposure], X, (2ΦGRM, Matrix{Float64}(I, length(Y), length(Y))))
    ### the computational trick when there are only two variance components ###
    nulldatarot = TwoVarCompVariateRotate(nulldata)
    ## set up the model ##
    nullmodel   = VarianceComponentModel(nulldatarot)
    traitdata_null = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1], nulldatarot.Xrot,
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel_null = VarianceComponentModel(traitdata_null)
    ## Have a choice of using an MM or Scoring Algorithm to do the maximization ##
    if algorithm == :MM
        logl, = mle_mm!(traitmodel_null, traitdata_null; verbose = false)
    elseif algorithm == :FS
        logl, = mle_fs!(traitmodel_null, traitdata_null; verbose = false)
    end

    ### IV as a predictor of the trait  ###
    traitdata = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1],
      [nulldatarot.Xrot transpose(nulldatarot.eigvec) * IV],
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel = VarianceComponentModel(traitdata)
    
    #Set the starting values for the maximum likelihood estimation
    #Use the null model estimates as starting values for the alternative model ##
    fill!(traitmodel.B, zero(T))
    copyto!(traitmodel.B, traitmodel_null.B)
    
    ## get standard error ##
    if algorithm == :MM
        _, _, _, _, traitBseMat, _= mle_mm!(traitmodel, traitdata; verbose = false)
    elseif algorithm == :FS
        _, _, _, _, traitBseMat, _ = mle_fs!(traitmodel, traitdata; verbose = false)
    end
    
    traitB = traitmodel.B[end]
    traitBse = traitBseMat[end]

    ### Repeat above steps using the IV as a predictor of the exposure ##
    expdata = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 2],
      [nulldatarot.Xrot transpose(nulldatarot.eigvec) * IV],
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    expmodel =  VarianceComponentModel(expdata)

    fill!(expmodel.B, zero(T))
    copyto!(expmodel.B, traitmodel_null.B)
    ## extract the standard errors " 
    if algorithm == :MM
      _, _, _, _, expBseMat, _= mle_mm!(expmodel, expdata; verbose = false)
    elseif algorithm == :FS
      _, _, _, _, expBseMat, _= mle_fs!(expmodel, expdata; verbose = false)
    end
    
    expB = expmodel.B[end]
    expBse = expBseMat[end]
    ## estimate the direct effect of the exposure on the trait using the ratio of effects ##
    directbeta = traitB / expB
    ## estimate the standard error of the direct effect ##
    SEbeta = sqrt(traitBse^2/expB^2 + traitB^2 * expBse^2/expB^4)
    W = (directbeta/SEbeta)^2
    
    #need to change the degrees of freedom if running a bivariate outcome
    pvalue = ccdf(Chisq(1), W)
    println("MR direct effects (SE): ", directbeta, "(", SEbeta, "), Pvalue is ", pvalue, "\n")
    println("Exposure effect (SE) is: ", expB, "(", expBse, ")", "\n")
    println("Trait effect (SE) is: ", traitB, "(",traitBse,")", "\n")
    return directbeta, SEbeta
  end


MendelianRandomization (generic function with 2 methods)

### Function to determine the Exposure Effect on the Trait using Variance Component Models 

The goal of Mendelian Randomization is to assess the statistical support for a causal explanation for an observed association between a trait and an "exposure." In other words, is the observed association due to the exposure having a true effect on the trait.  So we first need to verify that there is an association between the trait and the exposure.  To do so we created another function, ExposureEffectVCM, which calls the variance component models module along with the other modules needed to run this analysis.

We will use this function to determine the simulated Exposure Effect on the trait, adjusting for sex and accounting for family structure. 

In [4]:
function ExposureEffectVCM(Y::AbstractVecOrMat,
    X::AbstractVecOrMat,
    exposure::AbstractVecOrMat,
    ΦGRM::AbstractMatrix,
    algorithm::Symbol = :FS)
    
    T = eltype(X)
    n = length(exposure)

nulldata = VarianceComponentVariate(outcome, X, (2ΦGRM, Matrix{Float64}(I, length(Y), length(Y))))
    nulldatarot = TwoVarCompVariateRotate(nulldata)
    nullmodel   = VarianceComponentModel(nulldatarot)

#regress trait~null
    traitdata_null = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1], nulldatarot.Xrot,
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel_null = VarianceComponentModel(traitdata_null)

    if algorithm == :MM
        logl, = mle_mm!(traitmodel_null, traitdata_null; verbose = false)
    elseif algorithm == :FS
        logl, = mle_fs!(traitmodel_null, traitdata_null; verbose = false)
    end

### regress trait~exposure lmm 
    traitdata = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1],
      [nulldatarot.Xrot transpose(nulldatarot.eigvec) * exposure],
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel = VarianceComponentModel(traitdata)

#Set the starting values for the maximum likelihood estimation
    #Use the null model estimates as start values for the alternative model.
    fill!(traitmodel.B, zero(T))
    copyto!(traitmodel.B, traitmodel_null.B)
    if algorithm == :MM
        _, _, _, _, traitBseMat, _= mle_mm!(traitmodel, traitdata; verbose = false)
    elseif algorithm == :FS
        _, _, _, _, traitBseMat, _ = mle_fs!(traitmodel, traitdata; verbose = false)
    end
    traitBexp = traitmodel.B[end]
    traitBexpSE = traitBseMat[end]
return traitBexp, traitBexpSE
      end

ExposureEffectVCM (generic function with 2 methods)

## Examples

## Example 1: Fit MR-VCM with major locus, SNP rs11672206

In this example we use the SNP_data28d_trait1.fam data file. The trait was simulated under the scenario where the genetic variant is a strong instrumental variable (IV) and there is a direct effect of the exposure on the trait and so we expect the observed, naive association between the exposure and the trait to equal the direct effect we calculate using MR. 

### Data files

To simulate your own data examples you can use the Julia program TraitSimulation.jl  or the Trait simulation option of the fortran version of Mendel. See [Mendel Option 28 (Trait Simulation) example](https://www.genetics.ucla.edu/software/Mendel_current_doc.pdf#page=279). 

Take a look at the Pedigree file below. The columns are: :famid, :id, :moid, :faid, :sex, :twin, :simtrait1

In [5]:
using DelimitedFiles
Ped28_1 = readdlm("SNP_data28d_trait1.fam", ','; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  38.8938   90.1804  ""
     1   8228       "          "     "          "  47.0274  109.327   ""
     1  17008       "          "     "          "  44.1152  114.15    ""
     1   9218  17008                 "          "  43.0108  109.922   ""
     1   3226   9218                 "          "  42.4626  100.127   ""
     2     29       "          "  …  "          "  40.5792   95.7373  ""
     2   2294       "          "     "          "  40.5459  108.733   ""
     2   3416       "          "     "          "  46.5575  121.181   ""
     2   3916   2294                 "          "  43.2133  101.566   ""
     2   6790   2294                 "          "  45.1688  114.688   ""
     2  14695   2294              …  "          "  39.986    96.3424  ""
     2  17893   2294                 "          "  42.3638   99.7993  ""
     2   6952   3416                 "          "  50.7288  129.112   ""
     ⋮                         

In [6]:
exp_1 = convert(Array{Float64,1}, Ped28_1[:,7])

212-element Array{Float64,1}:
 38.89381
 47.02736
 44.11519
 43.01079
 42.4626 
 40.57919
 40.54588
 46.5575 
 43.21334
 45.16885
 39.98595
 42.3638 
 50.72882
  ⋮      
 44.80302
 38.49253
 38.9816 
 36.28899
 39.90458
 42.37186
 46.63421
 41.71284
 44.51775
 47.25572
 47.14371
 45.97838

We don't need to retain the IDs so we retrieve the phenotype data and put them in an array Y.

In [7]:
Y_1 = convert(Array{Float64,1}, Ped28_1[:,8])

212-element Array{Float64,1}:
  90.18044
 109.32747
 114.15019
 109.92158
 100.12725
  95.73733
 108.73254
 121.18095
 101.56554
 114.68795
  96.3424 
  99.79928
 129.11242
   ⋮      
 116.24136
  92.61769
  91.62061
  89.25164
 110.5402 
 113.43192
 115.74563
 104.29811
 116.29069
 119.83851
 114.20666
 114.94558

Retrieve sex data coded as 0 (male) or 1 (female) so male is the reference group.

In [8]:
sex_1 = map(x -> (strip(x) == "F") ? 1.0 : 0.0, Ped28_1[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 1.0
 0.0
 ⋮  
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Take a look at the first 10 lines of the SNP definition file. Notice the fist two lines are included by default for formatting, but should be excluded from the analysis. A semi colon ";" brings up a unix shell. Following shell commands assumes MacOS or Linux environment. Julia commands should run regardless of OS.

In [9]:
;head SNP_data28d_trait1.bim

19	rs3020701       	0	90974	1	2
19	rs56343121      	0	91106	1	2
19	rs143501051     	0	93542	1	2
19	rs56182540      	0	95981	1	2
19	rs7260412       	0	105021	1	2
19	rs11669393      	0	107866	1	2
19	rs181646587     	0	107894	1	2
19	rs8106297       	0	107958	1	2
19	rs8106302       	0	107962	1	2
19	rs183568620     	0	107987	1	2


In [10]:
snpdef28_1 = readdlm("SNP_data28d_trait1.bim", '\t'; header = false)

253141×6 Array{Any,2}:
 19  "rs3020701       "  0     90974  1  2
 19  "rs56343121      "  0     91106  1  2
 19  "rs143501051     "  0     93542  1  2
 19  "rs56182540      "  0     95981  1  2
 19  "rs7260412       "  0    105021  1  2
 19  "rs11669393      "  0    107866  1  2
 19  "rs181646587     "  0    107894  1  2
 19  "rs8106297       "  0    107958  1  2
 19  "rs8106302       "  0    107962  1  2
 19  "rs183568620     "  0    107987  1  2
 19  "rs186451972     "  0    108003  1  2
 19  "rs189699222     "  0    108032  1  2
 19  "rs182902214     "  0    108090  1  2
  ⋮                                      ⋮
 19  "rs188169422     "  0  59116080  1  2
 19  "rs144587467     "  0  59117729  1  2
 19  "rs139879509     "  0  59117949  1  2
 19  "rs143250448     "  0  59117982  1  2
 19  "rs145384750     "  0  59118028  1  2
 19  "rs149215836     "  0  59118040  1  2
 19  "rs139221927     "  0  59118044  1  2
 19  "rs181848453     "  0  59118114  1  2
 19  "rs138318162     "  0  591

In this example we will analyze a single SNP, rs11672206, so we don't need the position of the snps just the SNP IDs to find the SNP data. Thus we put SNP IDs in vector. 

In [11]:
snpid = map(x -> strip(string(x)), snpdef28_1[:, 2])

253141-element Array{SubString{String},1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

### Read in the SNP binary file using the SnpArray.jl package.

Because SnpArray function requires input file name ending in .bed rather than .bin, we create a symbolic link SNP_data28d_trait1.bed to SNP_data28d_trait1.bin.  (If you have trouble with getting this command to work on your computer you can copy the file outside of julia). Note that SnpArrays requires the set of PLINK files to share the same name. In this first example we use the files: SNP_data28d_trait1.bed, SNP_data28d_trait1.fam, and SNP_data28d_trait1.bim.

In [12]:
;ln -s ./SNP_data28d_trait1.bin ./SNP_data28d_trait1.bed

ln: ./SNP_data28d_trait1.bed: File exists


Read in the binary .bed file of genetic variants using the 'SnpArrays' package.

In [14]:
using SnpArrays
snpbin28_1 = SnpArray("SNP_data28d_trait1.bed")

212×253141 SnpArray:
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x02  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x02  0x02  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03  …  0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
    

### Kinship via Genetic Relationship Matrix (GRM)

Recall that in using variance components to account for the relatedness among individuals,  we need some measure of that relatedness. We could use the pedigree structure to calculate the theoretical kinships but that is an idealized measurement of the relatedness that assumes the pedigrees are known exactly and that founders are completely unrelated.  We don't want to make these assumptions so instead we use the SNP to get an estimate of the kinships.   
Under the GRM formulation, the estimate of the global kinship coefficient of individuals $i$ and $j$ is
$$ \widehat\Phi_{GRMij}^  = \frac{1}{2S} \sum_{k=1}^S \frac{(x_{ik} -2p_k)(x_{jk} - 2p_k)}{2 p_k (1-p_k)}$$,
where $k$ ranges over the selected $S$ SNPs, $p_k$ is the minor allele frequency of SNP $k$, and $x_{ik}$ is the number of minor alleles in individual $i$s genotype at SNP $k$.

## Calculate the GRM matrix

Use the grm function in the SnpArrays.jl package to create the genetic relationship matrix from the genetic variants in the .bed file for example 1. By default, `grm` excludes SNPs with maf < 0.01.

In [15]:
ΦGRM1 = grm(snpbin28_1)

212×212 Array{Float64,2}:
  0.488859      0.00422798    0.00978752   …   0.0187809     0.000152741
  0.00422798    0.515141     -0.0190676       -0.0220944    -0.0162381  
  0.00978752   -0.0190676     0.489156        -0.0143552    -0.00392193 
  0.241927     -0.00231651    0.266756         0.00264766   -0.00334162 
  0.124909      0.265109      0.12103         -0.0118053    -0.00641433 
 -0.0121766    -0.000271784  -0.00316506   …  -0.000629837   0.0103776  
 -0.011988      0.00845464   -0.00950518      -0.0105435     0.000340285
 -0.0134368    -0.00713175   -0.0101011       -0.0115942    -0.00288935 
 -0.0234305    -0.00241126   -0.010413         0.000129339   0.0166282  
 -0.0121471     0.00305141   -0.00754556      -0.0104162     0.00554306 
 -0.0128464     0.010258      0.00110475   …  -0.00424998    0.0124296  
 -0.0111854     0.00423488   -0.0117065       -0.0244977    -0.00698659 
 -0.0124607    -0.00575043   -0.0124951       -0.00573017   -0.0103429  
  ⋮                      

### Association of the Trait and Exposure:

In [16]:
###### First Example Exposure Effect on Trait using VCM ####
X = [ones(length(Y_1), 1) sex_1]
exposure = exp_1
outcome = Y_1

212-element Array{Float64,1}:
  90.18044
 109.32747
 114.15019
 109.92158
 100.12725
  95.73733
 108.73254
 121.18095
 101.56554
 114.68795
  96.3424 
  99.79928
 129.11242
   ⋮      
 116.24136
  92.61769
  91.62061
  89.25164
 110.5402 
 113.43192
 115.74563
 104.29811
 116.29069
 119.83851
 114.20666
 114.94558

In [17]:
#Calculate the exposure effect
ExposureEffectVCM(outcome, X, exposure, ΦGRM1, :FS)


******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************



(1.9471633423548604, 0.059848806492927765)

## Use MR to Check for Evidence of a Causal Relationship of Exposure on the Trait. 

### Prepare Data: Find the data for SNP rs11672206 and convert from binary to the number of minor alleles. 

In [18]:
ind_rs11672206 = findall(x -> x == "rs11672206", snpid)[1]

236079

In [19]:
snpbin28_1[:, ind_rs11672206]

212-element Array{UInt8,1}:
 0x00
 0x03
 0x00
 0x00
 0x02
 0x02
 0x00
 0x00
 0x02
 0x02
 0x02
 0x02
 0x02
    ⋮
 0x02
 0x02
 0x02
 0x02
 0x00
 0x02
 0x02
 0x02
 0x00
 0x02
 0x00
 0x02

In [20]:
snp_rs11672206 = convert(Array{Float64,1}, @view(snpbin28_1[:, ind_rs11672206]))

212-element Array{Float64,1}:
 0.0
 2.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0

### Run MR with variance components

In [21]:
###### First Example MR####
MendelianRandomization(Y_1, [ones(length(Y_1), 1) sex_1], snp_rs11672206, exp_1, ΦGRM1)

MR direct effects (SE): 1.9758504087435436(0.40446735551800067), Pvalue is 1.0339332851609442e-6

Exposure effect (SE) is: 2.5214901400746736(0.3269495431167816)

Trait effect (SE) is: 4.982087323909359(0.7891735914492959)



(1.9758504087435436, 0.40446735551800067)

The MR direct effect is calculated as the ratio of the effect of the SNP of the trait (trait effect) and the effect of the same SNP on the exposure (exposure effect). 
$\beta_{DE} = \beta_{TE} /\beta_{EE}$.   Note that in this case, our direct effect estimate is very similar to the naive estimate in which we used the exposure as a predictor.

# Example 2:  No Causal Relationship - Association due to Confounding 

In this example we use the the files: SNP_data28exptrait2.bed, SNP_data28exptrait2.fam, and SNP_data28exptrait2.bim, simulated under the scenario where there is still a strong IV but any association between the trait and the environmental predictor is due to confounding. 

## Loading and Preparing the Data

Take a look at the pedigree file.
columns are: :famid, :id, :moid, :faid, :sex, :twin, :simtrait1, :simtrait2

Read in the pedigree file. This file is in the classic Mendel format, Family Id, Person ID, Father ID, Mother Id, sex as F (female) or M (male), monozygotic twin indicator, simtrait1 and simtrait2. 

In [22]:
# columns are: :famid, :id, :moid, :faid, :sex, :twin, :simtrait1, :simtrait2
ped28_2 = readdlm("SNP_data28exptrait2.fam", ','; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  32.1022  14.7783  ""
     1   8228       "          "     "          "  32.0522  16.6838  ""
     1  17008       "          "     "          "  45.3005  21.6456  ""
     1   9218  17008                 "          "  43.4727  24.2119  ""
     1   3226   9218                 "          "  37.7578  20.511   ""
     2     29       "          "  …  "          "  36.9209  17.0922  ""
     2   2294       "          "     "          "  43.3864  20.414   ""
     2   3416       "          "     "          "  45.7102  23.3037  ""
     2  17893   2294                 "          "  33.4674  15.3641  ""
     2   6952   3416                 "          "  39.5669  21.3321  ""
     2  14695   2294              …  "          "  30.7948  16.3268  ""
     2   6790   2294                 "          "  38.5657  22.479   ""
     2   3916   2294                 "          "  34.3997  17.2739  ""
     ⋮                            ⋱  ⋮      

Again, we don't need to retain the ids so we retrieve the phenotype data and put them in Y_2 and exp_2, to be used as the respective outcome and exposure in example 2.

In [23]:
exp_2 = convert(Vector{Float64}, ped28_2[:, 7])
Y_2 = convert(Vector{Float64}, ped28_2[:, 8])
Exposure_Trait2 = [exp_2 Y_2]

212×2 Array{Float64,2}:
 32.1022  14.7783
 32.0522  16.6838
 45.3005  21.6456
 43.4727  24.2119
 37.7578  20.511 
 36.9209  17.0922
 43.3864  20.414 
 45.7102  23.3037
 33.4674  15.3641
 39.5669  21.3321
 30.7948  16.3268
 38.5657  22.479 
 34.3997  17.2739
  ⋮              
 43.8188  20.6551
 38.916   17.9165
 40.7922  23.3292
 34.847   13.2469
 41.9567  23.4621
 39.7982  20.5745
 39.3295  20.8389
 38.8928  16.6698
 46.2767  26.1484
 42.8913  23.4916
 44.5806  23.3856
 38.9998  22.224 

Retrieve sex data coded as 0 (male) or 1 (female) so male is the reference group.

In [24]:
sex_2 = map(x -> (strip(x) == "F") ? 1.0 : 0.0, ped28_2[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 ⋮  
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Read in the SNP definition file.

In [25]:
# columns are: :snpid, :chrom, :pos, :allele1, :allele2, :groupname
snpdef28_2 = readdlm("SNP_data28exptrait2.bim", '\t'; header = false)

253141×6 Array{Any,2}:
 19  " rs3020701       "  0     90974  1  2
 19  " rs56343121      "  0     91106  1  2
 19  " rs143501051     "  0     93542  1  2
 19  " rs56182540      "  0     95981  1  2
 19  " rs7260412       "  0    105021  1  2
 19  " rs11669393      "  0    107866  1  2
 19  " rs181646587     "  0    107894  1  2
 19  " rs8106297       "  0    107958  1  2
 19  " rs8106302       "  0    107962  1  2
 19  " rs183568620     "  0    107987  1  2
 19  " rs186451972     "  0    108003  1  2
 19  " rs189699222     "  0    108032  1  2
 19  " rs182902214     "  0    108090  1  2
  ⋮                                       ⋮
 19  " rs188169422     "  0  59116080  1  2
 19  " rs144587467     "  0  59117729  1  2
 19  " rs139879509     "  0  59117949  1  2
 19  " rs143250448     "  0  59117982  1  2
 19  " rs145384750     "  0  59118028  1  2
 19  " rs149215836     "  0  59118040  1  2
 19  " rs139221927     "  0  59118044  1  2
 19  " rs181848453     "  0  59118114  1  2
 19  " rs

We will be analyzing SNPs one at a time so we don't need the position of the snps just the SNP IDs so we retrieve SNP IDs.

In [26]:
snpid_2 = map(x -> strip(string(x)), snpdef28_2[:, 2])

253141-element Array{SubString{String},1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

Read in the SNP binary file using the SnpArray.jl package.

In [27]:
;ln -s ./SNP_data28exptrait2.bin ./SNP_data28exptrait2.bed

ln: ./SNP_data28exptrait2.bed: File exists


Read in the binary .bed file of genetic variants using the 'SnpArrays' package.

In [28]:
snpbin28_2 = SnpArray("SNP_data28exptrait2.bed")

212×253141 SnpArray:
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x02  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x02  0x02  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03  …  0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
    

Use the grm function in the SnpArrays.jl package to create the genetic relationship matrix from the genetic variants in the .bed file for example 2. By default, `grm` excludes SNPs with maf < 0.01.

In [29]:
ΦGRM2 = grm(snpbin28_2)

212×212 Array{Float64,2}:
  0.488859      0.00422798    0.00978752   …   0.0187809     0.000152741
  0.00422798    0.515141     -0.0190676       -0.0220944    -0.0162381  
  0.00978752   -0.0190676     0.489156        -0.0143552    -0.00392193 
  0.241927     -0.00231651    0.266756         0.00264766   -0.00334162 
  0.124909      0.265109      0.12103         -0.0118053    -0.00641433 
 -0.0121766    -0.000271784  -0.00316506   …  -0.000629837   0.0103776  
 -0.011988      0.00845464   -0.00950518      -0.0105435     0.000340285
 -0.0134368    -0.00713175   -0.0101011       -0.0115942    -0.00288935 
 -0.0111854     0.00423488   -0.0117065       -0.0244977    -0.00698659 
 -0.0124607    -0.00575043   -0.0124951       -0.00573017   -0.0103429  
 -0.0128464     0.010258      0.00110475   …  -0.00424998    0.0124296  
 -0.0121471     0.00305141   -0.00754556      -0.0104162     0.00554306 
 -0.0234305    -0.00241126   -0.010413         0.000129339   0.0166282  
  ⋮                      

## Check for Association of the Trait with the Exposure:

In [30]:
###### Second Example Exposure Effect on Trait using VCM ####
X = [ones(length(Y_2), 1) sex_2]
exposure = exp_2
outcome = Y_2
ExposureEffectVCM(outcome, X, exposure, ΦGRM2)

(0.49947517803154445, 0.05399663242445667)

In [31]:
ind_rs11672206_2 = findall(x -> x == "rs11672206", snpid_2)[1]

236079

In [32]:
snp_rs11672206_2 = convert(Array{Float64,1}, @view(snpbin28_2[:, ind_rs11672206_2]))

212-element Array{Float64,1}:
 0.0
 2.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0

## Example 2 Mendelian Randomization VCM:

In [33]:
#Example 2
MendelianRandomization(Y_2, [ones(length(Y_2), 1) sex_2], snp_rs11672206_2, exp_2, ΦGRM2)

MR direct effects (SE): 0.2437481344559426(0.15233855361360252), Pvalue is 0.10958919289247747

Exposure effect (SE) is: -1.8980279594308787(0.25670960108586316)

Trait effect (SE) is: -0.4626407742564962(0.2822910952147581)



(0.2437481344559426, 0.15233855361360252)

In this example, The MR direct effect and the naive estimate in which we regress Trait on Exposure are quite different. We fail to reject the null hypothesis of a causal effect of the exposure on the trait. 

# Example 3: The Problem of a Weak IV

In this example we use the the files: SNP_data28d_trait3.bed, SNP_data28d_trait3.fam, and SNP_data28d_trait3.bim, simulated under the scenario where the genetic marker is a weak IV to demonstrate one of the essential assumptions of MR.  In this example we use the same genotype data and pedigree structure as example 1 so we can use the GRM matrix from example 1.

### Loading and preparing the data

In [34]:
ped28_3 = readdlm("SNP_data28d_trait3.fam", ','; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  40.1878  214.434  ""
     1   8228       "          "     "          "  47.0623  250.469  ""
     1  17008       "          "     "          "  44.8772  249.917  ""
     1   9218  17008                 "          "  41.4107  231.419  ""
     1   3226   9218                 "          "  40.3835  217.034  ""
     2     29       "          "  …  "          "  39.6165  212.84   ""
     2   2294       "          "     "          "  38.8691  220.87   ""
     2   3416       "          "     "          "  48.9882  271.711  ""
     2   3916   2294                 "          "  45.129   240.725  ""
     2   6790   2294                 "          "  43.2185  240.717  ""
     2  14695   2294              …  "          "  38.846   210.021  ""
     2  17893   2294                 "          "  43.9826  234.954  ""
     2   6952   3416                 "          "  50.7849  280.457  ""
     ⋮                            ⋱  ⋮      

In [35]:
exp_3 = convert(Array{Float64,1}, ped28_3[:, 7])

212-element Array{Float64,1}:
 40.18776
 47.06231
 44.87715
 41.41074
 40.38347
 39.61654
 38.8691 
 48.98817
 45.12903
 43.21847
 38.84599
 43.98256
 50.78492
  ⋮      
 41.12748
 38.56373
 35.6761 
 32.72347
 37.78963
 38.53456
 45.70917
 37.42526
 45.55476
 46.75532
 49.9749 
 44.60524

In [36]:
Y_3 = convert(Array{Float64,1}, ped28_3[:, 8])

212-element Array{Float64,1}:
 214.43354
 250.46902
 249.9168 
 231.41861
 217.03401
 212.8396 
 220.87015
 271.71098
 240.72532
 240.71722
 210.02121
 234.95418
 280.45734
   ⋮      
 231.58155
 208.1839 
 192.60535
 179.58364
 217.25697
 219.80218
 252.08932
 209.74325
 254.07583
 258.96543
 271.94112
 246.86504

In [37]:
sex_3 = map(x -> (strip(x) == "F") ? 1.0 : 0.0, ped28_3[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 1.0
 0.0
 ⋮  
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

In [38]:
snpid = map(x -> strip(string(x)), snpdef28_1[:, 2])

253141-element Array{SubString{String},1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

## Exposure Effect VCM:

In this example there is still a strong association between the trait and the exposure in which the exposure has a causal effect but the association between the SNP and the exposure  and the SNP and the trait are weak.  

In [39]:
###### Third Example Exposure Effect on Trait using VCM ####
X = [ones(length(Y_3), 1) sex_3]
exposure = exp_3
outcome = Y_3
ExposureEffectVCM(outcome, X, exposure, ΦGRM1)

(4.976393615289701, 0.02101062009606309)

In [40]:
ind_rs11672206 = findall(x -> x == "rs11672206", snpid)[1]

236079

In [41]:
snp_rs11672206_3 = convert(Array{Float64,1}, @view(snpbin28_1[:, ind_rs11672206]))

212-element Array{Float64,1}:
 0.0
 2.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0

## Example 3 Mendelian Randomization VCM:

In [42]:
MendelianRandomization(Y_3,[ones(length(Y_3),1) sex_3], snp_rs11672206_3, exp_3, ΦGRM1)

MR direct effects (SE): 5.653306721587125(6.183010137523407), Pvalue is 0.3605438909407806

Exposure effect (SE) is: 0.5704605759599244(0.4670817955931797)

Trait effect (SE) is: 3.2249886084747033(2.3384488620967234)



(5.653306721587125, 6.183010137523407)

In this case, we simulated a causal effect of exposure on the trait such that one unit of exposure increases the trait value by 5 units.  Because the IV is weak, we end up with an imprecise and inaccurate estimate of the direct effect using MR.  As is the case here, the estimate is often an overestimate but the SE is very large, leading to poor power to detect the effect. 

# Conclusions

This exercise demonstrates:
(1) That, when the assumptions of MR hold, we can get an idea of the causal nature of an exposure with a trait when we have family data.   Note we just stratched the surface of methods that can be used to infer causality.  Within MR there are a number of approaches that have been optimized for different types of data.  More generally, causal inference is a huge field that has been importance in epidemiology and economics in particular and we encourage you to learn more about these approaches. 