# Mendelian Randomization (MR) Using Variance Component Models (VCM)

In this notebook, we demonstrate in 4 examples how to conduct Mendelian randomization (MR) using variance components as a method of causal inference. We don't go into all the assumptions, caveats, extension, disadvantages or advantages of Mendelian randomization. For these aspects we refer the user to the book "Mendelian Randomization: Methods for using Genetic Variants in Causal Estimation" by Steven Burgess and Simon G. Thompson.  The difference in our approach versus most examples of MR is that we are using family data and so we must also take into account the correlations among relatives in our analyses - thus we use variance component analysis (aka linear mixed modeling). We simulate the trait, exposure and have genotypes all on the family members.  In principle the exposure IV relationship and the trait IV relationship can be examined in different samples but this requires additional assumptions.  The notebook is organized as follows: 

We simulated data outside of this notebook using starting the Ped-GWAS data from example 28d of Mendel v16.0 which available as part of the Fortran Mendel package (available at http://www.genetics.ucla.edu/software/). 

Example 1: In this example we use MR-VCM with SNP rs11672206 simulated as a strong instrumental variable (IV) and simulated phenotype data. This first example provides the user with an example in which the genetic variant is a strong IV and there is a direct effect of the exposure on the trait. This example shows one scenario in which MR can be quite useful. 

Example 2: In this example we use MR-VCM with SNP rs11672206 again simulated as a strong IV and simulated phenotype data. In this second example there is still a strong IV but any association between the trait and the environmental predictor is due to confounding. This example shows another scenario in which MR can be quite useful. 

Example 3: In this final example we use MR-VCM with SNP rs7255584 as the IV. The actual IV is still rs11672206 but this time we simulate it to be a relatively weak IV. Furthermore we don't use rs11672206, we use a proxy IV rs192667878 that is in LD with the true IV.  In this third example, we demonstrate that there can be considerable bias if the marker is a weak IV. This one of the problems that can occur in practice with MR.   

## 2 Stage Regression Framework for Mendelian Randomization

First we introduce the module MRVC which calls the variance components model along with the other modules needed to run variance component models. 

In [1]:
module MendelianRandomizationVCM

export MendelianRandomization
using DataFrames
using StatsBase
using VarianceComponentModels
using GLM
using SnpArrays
using Distributions

#function MR(IV,Expose,Trait)
function MendelianRandomization(Y::AbstractVecOrMat,
    X::AbstractVecOrMat,
    #IV::SnpArrays.SnpArray{2},
    IV::AbstractVecOrMat,
    exposure::AbstractVecOrMat,
    GRM::AbstractMatrix,
    algorithm::Symbol = :FS)

    T = eltype(X)
    n = length(IV)

    ## fit null model without IV effects ##
    nulldata    = VarianceComponentVariate([Y exposure], X, (2GRM, eye(n)))
    nulldatarot = TwoVarCompVariateRotate(nulldata)
    nullmodel   = VarianceComponentModel(nulldatarot)

    traitdata_null = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1], nulldatarot.Xrot,
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel_null = VarianceComponentModel(traitdata_null)

    if algorithm == :MM
        logl, = mle_mm!(traitmodel_null, traitdata_null; verbose = false)
    elseif algorithm == :FS
        logl, = mle_fs!(traitmodel_null, traitdata_null; verbose = false)
    end

    ### regress IV with trait
    traitdata = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1],
      [nulldatarot.Xrot At_mul_B(nulldatarot.eigvec, IV)],
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel = VarianceComponentModel(traitdata)
    #Set the starting values for the maximum likelihood estimation
    #Use the null model estimates as start values for the alternative model.
    fill!(traitmodel.B, zero(T))
    copy!(traitmodel.B, traitmodel_null.B)
    if algorithm == :MM
        _, _, _, _, traitBseMat, _= mle_mm!(traitmodel, traitdata; verbose = false)
    elseif algorithm == :FS
        _, _, _, _, traitBseMat, _ = mle_fs!(traitmodel, traitdata; verbose = false)
    end
    traitB = traitmodel.B[end]
    traitBse = traitBseMat[end]

    ### regress IV with exposure
    expdata = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 2],
      [nulldatarot.Xrot At_mul_B(nulldatarot.eigvec, IV)],
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    expmodel =  VarianceComponentModel(expdata)

    fill!(expmodel.B, zero(T))
    copy!(expmodel.B, traitmodel_null.B)
    if algorithm == :MM
      _, _, _, _, expBseMat, _= mle_mm!(expmodel, expdata; verbose = false)
    elseif algorithm == :FS
      _, _, _, _, expBseMat, _= mle_fs!(expmodel, expdata; verbose = false)
    end
    expB = expmodel.B[end]
    expBse = expBseMat[end]

    directbeta = traitB / expB
  	SEbeta = sqrt(traitBse^2/expB^2+traitB^2*expBse^2/expB^4)
    W = (directbeta/SEbeta)^2
    #change the degrees of freedom if running a bivariate outcome
    pvalue = ccdf(Chisq(1), W)
    println("MR direct effects (SE): ", directbeta, "(", SEbeta, "), Pvalue is ", pvalue,"\n")
    println("Exposure effect (SE) is: ",expB,"(", expBse, ")","\n")
    println("Trait effect (SE) is: ", traitB,"(",traitBse,")","\n")
    return directbeta, SEbeta
  end
end # module

[1m[36mINFO: [39m[22m[36mPrecompiling module MathProgBase.
[39m[1m[36mINFO: [39m[22m[36mPrecompiling module Ipopt.
[39m
Use "VarianceComponentModel{T,M,BT,ΣT}(...) where {T,M,BT,ΣT}" instead.

Use "TwoVarCompModelRotate{T,BT}(...) where {T,BT}" instead.

Use "VarianceComponentVariate{T,M,YT,XT,VT}(...) where {T,M,YT,XT,VT}" instead.

Use "TwoVarCompVariateRotate{T,YT,XT}(...) where {T,YT,XT}" instead.
[1m[36mINFO: [39m[22m[36mPrecompiling module GLM.
[39m

MendelianRandomizationVCM

In [2]:
using MendelianRandomizationVCM

# Exposure Effect on Trait using Variance Component Models 

Next we use the ExposureEffectVCM function which calls the variance component models module along with the other modules needed to run variance component models. 
 
 Because we have similated the trait, exposure and genotypes on the same family members, we can also estimate the beta coefficient for the regression of trait on exposure adjusting for the correlation in family members trait values. 

In [3]:
using DataFrames
using StatsBase
using VarianceComponentModels
using GLM
using SnpArrays
using Distributions

function ExposureEffectVCM(Y::AbstractVecOrMat,
    X::AbstractVecOrMat,
    exposure::AbstractVecOrMat,
    GRM::AbstractMatrix)
    
    T = eltype(X)
    n = length(exposure)

nulldata = VarianceComponentVariate(outcome, X, (ΦGRM, eye(length(outcome))))
    nulldatarot = TwoVarCompVariateRotate(nulldata)
    nullmodel   = VarianceComponentModel(nulldatarot)

#regress trait~null
    traitdata_null = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1], nulldatarot.Xrot,
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel_null = VarianceComponentModel(traitdata_null)

  #  if algorithm == :MM
  #      logl, = mle_mm!(traitmodel_null, traitdata_null; verbose = false)
  #  elseif algorithm == :FS
        logl, = mle_fs!(traitmodel_null, traitdata_null; verbose = false)
  #  end

### regress trait~exposure lmm 
    traitdata = TwoVarCompVariateRotate(nulldatarot.Yrot[:, 1],
      [nulldatarot.Xrot At_mul_B(nulldatarot.eigvec, exposure)],
      nulldatarot.eigval, nulldatarot.eigvec, nulldatarot.logdetV2)
    traitmodel = VarianceComponentModel(traitdata)

#Set the starting values for the maximum likelihood estimation
    #Use the null model estimates as start values for the alternative model.
    fill!(traitmodel.B, zero(T))
    copy!(traitmodel.B, traitmodel_null.B)
   # if algorithm == :MM
   #     _, _, _, _, traitBseMat, _= mle_mm!(traitmodel, traitdata; verbose = false)
  #  elseif algorithm == :FS
        _, _, _, _, traitBseMat, _ = mle_fs!(traitmodel, traitdata; verbose = false)
  #  end
    traitBexp = traitmodel.B[end]
    traitBexpSE = traitBseMat[end]
return traitBexp, traitBexpSE
      end
#pval = ccdf(Chisq(1), (model_mle.B[2] / Σse[2][1])^2) 

ExposureEffectVCM (generic function with 1 method)

## Example 1: Fit MR-VCM with major locus, SNP rs11672206

In this example we use the Ped28trait.out data file, simulated under the scenario where the genetic variant is a strong instrumental variable (IV) and there is a direct effect of the exposure on the trait.

### Data files

We start from the following 3 files from [Mendel Option 28 (Trait Simulation) example](https://www.genetics.ucla.edu/software/Mendel_current_doc.pdf#page=279). Following shell commands assumes MacOS or Linux environment. Julia commands should run regardless of OS.

Take a look at the Pedigree file below. The columns are: :famid, :id, :moid, :faid, :sex, :twin, :exp1 :simtrait1 but in this file the labels are not present.  If the user has data with headers then they can change the command to 
Data= readcsv("filename.in", Any; header = true)

In [4]:
Ped28_1 = readcsv("Ped28trait1.out", Any; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  38.8938   90.1804  ""
     1   8228       "          "     "          "  47.0274  109.327   ""
     1  17008       "          "     "          "  44.1152  114.15    ""
     1   9218  17008                 "          "  43.0108  109.922   ""
     1   3226   9218                 "          "  42.4626  100.127   ""
     2     29       "          "  …  "          "  40.5792   95.7373  ""
     2   2294       "          "     "          "  40.5459  108.733   ""
     2   3416       "          "     "          "  46.5575  121.181   ""
     2   3916   2294                 "          "  43.2133  101.566   ""
     2   6790   2294                 "          "  45.1688  114.688   ""
     2  14695   2294              …  "          "  39.986    96.3424  ""
     2  17893   2294                 "          "  42.3638   99.7993  ""
     2   6952   3416                 "          "  50.7288  129.112   ""
     ⋮                         

We put the exposure data are placed in the array exp_1

In [6]:
exp_1 = convert(Array{Float64,1}, Ped28_1[:,7])

212-element Array{Float64,1}:
 38.8938
 47.0274
 44.1152
 43.0108
 42.4626
 40.5792
 40.5459
 46.5575
 43.2133
 45.1688
 39.986 
 42.3638
 50.7288
  ⋮     
 44.803 
 38.4925
 38.9816
 36.289 
 39.9046
 42.3719
 46.6342
 41.7128
 44.5177
 47.2557
 47.1437
 45.9784

We don't need to retain the IDs so we retrieve the phenotype data and put them in an array Y_1.

In [20]:
Y_1 = convert(Array{Float64,1}, Ped28_1[:,8])

212-element Array{Float64,1}:
  90.1804
 109.327 
 114.15  
 109.922 
 100.127 
  95.7373
 108.733 
 121.181 
 101.566 
 114.688 
  96.3424
  99.7993
 129.112 
   ⋮     
 116.241 
  92.6177
  91.6206
  89.2516
 110.54  
 113.432 
 115.746 
 104.298 
 116.291 
 119.839 
 114.207 
 114.946 

Retrieve sex data coded as 0 (male) or 1 (female) so male is the reference group.

In [7]:
sex_1 = map(x -> strip(x) == "F"? 1.0 : 0.0, Ped28_1[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 1.0
 0.0
 ⋮  
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Take a look at the first 10 lines of the SNP definition file. Notice the first two lines are included by default for formatting, and should be excluded from the analysis. 

In [8]:
;head SNP_def28trait1.out

    3.00  = FILE FORMAT VERSION NUMBER.
 
rs3020701       ,19,           90974,   1,   2
rs56343121      ,19,           91106,   1,   2
rs143501051     ,19,           93542,   1,   2
rs56182540      ,19,           95981,   1,   2
rs7260412       ,19,          105021,   1,   2
rs11669393      ,19,          107866,   1,   2
rs181646587     ,19,          107894,   1,   2
rs8106297       ,19,          107958,   1,   2


In [10]:
snpdef28_1 = readcsv("SNP_def28trait1.out", Any; skipstart = 2, header = false)

253141×6 Array{Any,2}:
 "rs3020701       "  19     90974  1  2  ""
 "rs56343121      "  19     91106  1  2  ""
 "rs143501051     "  19     93542  1  2  ""
 "rs56182540      "  19     95981  1  2  ""
 "rs7260412       "  19    105021  1  2  ""
 "rs11669393      "  19    107866  1  2  ""
 "rs181646587     "  19    107894  1  2  ""
 "rs8106297       "  19    107958  1  2  ""
 "rs8106302       "  19    107962  1  2  ""
 "rs183568620     "  19    107987  1  2  ""
 "rs186451972     "  19    108003  1  2  ""
 "rs189699222     "  19    108032  1  2  ""
 "rs182902214     "  19    108090  1  2  ""
 ⋮                                       ⋮ 
 "rs188169422     "  19  59116080  1  2  ""
 "rs144587467     "  19  59117729  1  2  ""
 "rs139879509     "  19  59117949  1  2  ""
 "rs143250448     "  19  59117982  1  2  ""
 "rs145384750     "  19  59118028  1  2  ""
 "rs149215836     "  19  59118040  1  2  ""
 "rs139221927     "  19  59118044  1  2  ""
 "rs181848453     "  19  59118114  1  2  ""
 "rs13831

In this example we will analyze a single SNP, rs10412915, so we don't need the position of the snps just the SNP IDs so we just retrieve SNP ID not the bps.

In [11]:
snpid = map(x -> strip(string(x)), snpdef28_1[:, 1])

253141-element Array{AbstractString,1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

Read in the SNP binary file using the SnpArray.jl package.

Because SnpArray function requires input file name ending in .bed rather than .bin, we create a symbolic link SNP_data29a.bed to SNP_data29a.bin.  (If you have trouble with getting this command to work on your computer you can copy the file outside of julia).
## NOTE:
It is VERY important that the SNP data have columns that are in the same order as in the snp definition file and that the rows are in the same order as in the pedigree file.  If they are not you will get wrong answers but will not get an error message!

In [12]:
;ln -s ./SNP_data28d_trait1.bin ./SNP_data28d_trait1.bed

ln: ./SNP_data28d_trait1.bed: File exists


In [13]:
;ls

Control28bivariate.in
Control_exposure1.in
Control_exposure3.in
Control_trait1.in
Control_trait3.in
Def28e2.in
Def28exp1.out
Def28exp3.out
Def28exposure.in
Def28exptrait.out
Def28exptrait2.out
Def28trait.out
Def28trait1.out
Def28trait3.out
Mendel28exposure1.out
Mendel28exposure3.out
Mendel28exptrait2.out
Mendel28trait1.out
Mendel28trait3.out
MendelianRandomization_LMM.ipynb
MendelianRandomization_VCM_JanetSarah3152018.ipynb
MendelianRandomization_VCM_JanetSarahMarch6.ipynb
Ped28d.in
Ped28d2.in
Ped28exp1.out
Ped28exp3.out
Ped28exptrait.out
Ped28exptrait2.out
Ped28trait1.out
Ped28trait3.out
SNP_data28d.bin
SNP_data28d2.bin
SNP_data28d_exp1.bin
SNP_data28d_exp3.bin
SNP_data28d_trait1.bed
SNP_data28d_trait1.bin
SNP_data28d_trait3.bed
SNP_data28d_trait3.bin
SNP_data28exptrait.bin
SNP_data28exptrait2.bed
SNP_data28exptrait2.bin
SNP_def28d.in
SNP_def28d2.in
SNP_def28exp1.out
SNP_def28exp3.out
SNP_def28exptrait.out
SNP_def28exptrait2.out
SNP_def28trait1.out
SNP_def28trait3.out
Simulation28exp.

In [14]:
using SnpArrays
snpbin28_1 = SnpArray("SNP_data28d_trait1"; people = size(Ped28_1, 1), snps = size(snpdef28_1, 1))

[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m

212×253141 SnpArrays.SnpArray{2}:
 (true, true)  (true, true)   …  (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)   …  (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)  …  (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 ⋮                            ⋱                  ⋮             
 (true, true)  (false, true)  …  (true, true)    (false, false)
 (true

## Kinship via Genetic Relationship Matrix (GRM)

Recall that in using variance components (linear mixed models) we need a measure of the relatedness among individuals. Under the GRM formulation, the estimate of the global kinship coefficient of individuals $i$ and $j$ is
$$ \widehat\Phi_{GRMij}^  = \frac{1}{2S} \sum_{k=1}^S \frac{(x_{ik} -2p_k)(x_{jk} - 2p_k)}{2 p_k (1-p_k)}$$,
where $k$ ranges over the selected $S$ SNPs, $p_k$ is the minor allele frequency of SNP $k$, and $x_{ik}$ is the number of minor alleles in individual $i$s genotype at SNP $k$.

## Calculate the GRM matrix

By default, `grm` excludes SNPs with maf < 0.01.

In [15]:
Î¦grm_snp28_1 = grm(snpbin28_1)

212×212 Array{Float64,2}:
  0.488859      0.00422798    0.00978752   …   0.0187809     0.000152741
  0.00422798    0.515141     -0.0190676       -0.0220944    -0.0162381  
  0.00978752   -0.0190676     0.489156        -0.0143552    -0.00392193 
  0.241927     -0.00231651    0.266756         0.00264766   -0.00334162 
  0.124909      0.265109      0.12103         -0.0118053    -0.00641433 
 -0.0121766    -0.000271784  -0.00316506   …  -0.000629837   0.0103776  
 -0.011988      0.00845464   -0.00950518      -0.0105435     0.000340285
 -0.0134368    -0.00713175   -0.0101011       -0.0115942    -0.00288935 
 -0.0234305    -0.00241126   -0.010413         0.000129339   0.0166282  
 -0.0121471     0.00305141   -0.00754556      -0.0104162     0.00554306 
 -0.0128464     0.010258      0.00110475   …  -0.00424998    0.0124296  
 -0.0111854     0.00423488   -0.0117065       -0.0244977    -0.00698659 
 -0.0124607    -0.00575043   -0.0124951       -0.00573017   -0.0103429  
  ⋮                      

We need to locate SNP rs11672206 and find that it is the 236079th SNP in the file. 

In [17]:
ind_rs11672206 = find(x -> x == "rs11672206", snpid)[1]

236079

In [18]:
snp_rs11672206 = convert(Array{Float64,1}, snpbin28_1[:, ind_rs11672206])

212-element Array{Float64,1}:
 0.0
 2.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0

## Example 1 MR when there is a direct effect of exposure on trait. 

In [21]:
###### First Example MR####
MendelianRandomization(Y_1, [ones(length(Y_1), 1) sex_1], snp_rs11672206, exp_1, Î¦grm_snp28_1)


******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

MR direct effects (SE): 1.9758033020896233(0.4044635920194066), Pvalue is 1.0343059935417672e-6

Exposure effect (SE) is: 2.5214901400747523(0.3269495431167835)

Trait effect (SE) is: 4.981968544946122(0.7891739351684997)



(1.9758033020896233, 0.4044635920194066)

In this example, The mean exposure level is 40, males have mean value of 43 and females have a mean value of 37, the effect of the IV on the exposure was simulated increase the exposure level 2.5 units per 2 allele and the heritability of the exposure is 67%.  The exposure level is estimated to be 2.52 sd = 0.79, which is very close to this value. 

We also simulated a direct effect of the exposure on the trait of 2 units per unit of exposure.  The MR direct effect is very close to this value (1.96 sd = 0.40) and leads to inference of a direct effect of exposure on the trait. In this case we conduct the regression of trait on exposure we get a very similar result mean = 1.95. We can see however that the MR estimate has a bigger sd - this is due to estimating the value as a two step procedure and so one might think the regression of trait on exposure is preferable. Although it is in this case, it is not when there is confounding (example 2).

In [22]:
###### First Example Exposure Effect on Trait using VCM ####
X = [ones(length(Y_1), 1) sex_1]
exposure = exp_1
outcome = Y_1
ΦGRM = Î¦grm_snp28_1
ExposureEffectVCM(outcome, X, exposure, ΦGRM)

(1.9471633422983434, 0.059848806495791294)

# Example 2:  Confounding - No Direct Effect of Exposure on Trait.

In this example we use the Ped28exptrait2.out data file, simulated under the scenario where there is still a strong IV but any association between the trait and the environmental predictor is due to confounding. 

## Mendel Option 28 data

Take a look at the pedigree file.
columns are: :famid, :id, :moid, :faid, :sex, :twin, :exp2, :simtrait2

Read in the pedigree file. This file is in the classic Mendel format, Family Id, Person ID, Father ID, Mother Id, sex as F (female) or M (male), monozygotic twin indicator, exposure 2 and simtrait 2. 

In [23]:
# columns are: :famid, :id, :moid, :faid, :sex, :twin, :simtrait1, :simtrait2
ped28_2 = readcsv("Ped28exptrait2.out", Any; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  32.1022  14.7783  ""
     1   8228       "          "     "          "  32.0522  16.6838  ""
     1  17008       "          "     "          "  45.3005  21.6456  ""
     1   9218  17008                 "          "  43.4727  24.2119  ""
     1   3226   9218                 "          "  37.7578  20.511   ""
     2     29       "          "  …  "          "  36.9209  17.0922  ""
     2   2294       "          "     "          "  43.3864  20.414   ""
     2   3416       "          "     "          "  45.7102  23.3037  ""
     2  17893   2294                 "          "  33.4674  15.3641  ""
     2   6952   3416                 "          "  39.5669  21.3321  ""
     2  14695   2294              …  "          "  30.7948  16.3268  ""
     2   6790   2294                 "          "  38.5657  22.479   ""
     2   3916   2294                 "          "  34.3997  17.2739  ""
     ⋮                            ⋱  ⋮      

We don't need to retain the ids so we retrieve the phenotype data and put them in Y_2 and exp_2, to be used as the respective outcome and exposure in example 2.

In [24]:
exp_2 = convert(Vector{Float64}, ped28_2[:, 7])
Y_2 = convert(Vector{Float64}, ped28_2[:, 8])
Exposure_Trait2 = [exp_2 Y_2]

212×2 Array{Float64,2}:
 32.1022  14.7783
 32.0522  16.6838
 45.3005  21.6456
 43.4727  24.2119
 37.7578  20.511 
 36.9209  17.0922
 43.3864  20.414 
 45.7102  23.3037
 33.4674  15.3641
 39.5669  21.3321
 30.7948  16.3268
 38.5657  22.479 
 34.3997  17.2739
  ⋮              
 43.8188  20.6551
 38.916   17.9165
 40.7922  23.3292
 34.847   13.2469
 41.9567  23.4621
 39.7982  20.5745
 39.3295  20.8389
 38.8928  16.6698
 46.2767  26.1484
 42.8913  23.4916
 44.5806  23.3856
 38.9998  22.224 

Retrieve sex data coded as 0 (male) or 1 (female) so male is the reference group.

In [25]:
sex_2 = map(x -> strip(x) == "F"? 1.0 : 0.0, ped28_2[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 ⋮  
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Read in the SNP definition file, skipping the first 2 lines.

In [26]:
# columns are: :snpid, :chrom, :pos, :allele1, :allele2, :groupname
snpdef28_2 = readcsv("SNP_def28exptrait2.out", Any; skipstart = 2, header = false)

253141×6 Array{Any,2}:
 "rs3020701       "  19     90974  1  2  ""
 "rs56343121      "  19     91106  1  2  ""
 "rs143501051     "  19     93542  1  2  ""
 "rs56182540      "  19     95981  1  2  ""
 "rs7260412       "  19    105021  1  2  ""
 "rs11669393      "  19    107866  1  2  ""
 "rs181646587     "  19    107894  1  2  ""
 "rs8106297       "  19    107958  1  2  ""
 "rs8106302       "  19    107962  1  2  ""
 "rs183568620     "  19    107987  1  2  ""
 "rs186451972     "  19    108003  1  2  ""
 "rs189699222     "  19    108032  1  2  ""
 "rs182902214     "  19    108090  1  2  ""
 ⋮                                       ⋮ 
 "rs188169422     "  19  59116080  1  2  ""
 "rs144587467     "  19  59117729  1  2  ""
 "rs139879509     "  19  59117949  1  2  ""
 "rs143250448     "  19  59117982  1  2  ""
 "rs145384750     "  19  59118028  1  2  ""
 "rs149215836     "  19  59118040  1  2  ""
 "rs139221927     "  19  59118044  1  2  ""
 "rs181848453     "  19  59118114  1  2  ""
 "rs13831

We will be analyzing a single SNP (rather than taking into account LD) so we don't need the position of the snps just the SNP IDs so we retrieve SNP IDs.

In [27]:
snpid_2 = map(x -> strip(string(x)), snpdef28_2[:, 1])

253141-element Array{AbstractString,1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

Read in the SNP binary file using the SnpArray.jl package. 
#### Again it is very important to use the bed file that corresponds to the fam and snp definition file. 

In [28]:
;ln -s ./SNP_data28exptrait2.bin ./SNP_data28exptrait2.bed

ln: ./SNP_data28exptrait2.bed: File exists


In [29]:
snpbin28_2 = SnpArray("SNP_data28exptrait2.bed"; people = size(ped28_2, 1), snps = size(snpdef28_2, 1))

[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m

212×253141 SnpArrays.SnpArray{2}:
 (true, true)  (true, true)   …  (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)   …  (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (false, true)  …  (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 ⋮                            ⋱                  ⋮             
 (true, true)  (true, true)   …  (true, true)    (false, false)
 (true

In [30]:
Î¦grm_snp28_2 = grm(snpbin28_2)

212×212 Array{Float64,2}:
  0.488859      0.00422798    0.00978752   …   0.0187809     0.000152741
  0.00422798    0.515141     -0.0190676       -0.0220944    -0.0162381  
  0.00978752   -0.0190676     0.489156        -0.0143552    -0.00392193 
  0.241927     -0.00231651    0.266756         0.00264766   -0.00334162 
  0.124909      0.265109      0.12103         -0.0118053    -0.00641433 
 -0.0121766    -0.000271784  -0.00316506   …  -0.000629837   0.0103776  
 -0.011988      0.00845464   -0.00950518      -0.0105435     0.000340285
 -0.0134368    -0.00713175   -0.0101011       -0.0115942    -0.00288935 
 -0.0111854     0.00423488   -0.0117065       -0.0244977    -0.00698659 
 -0.0124607    -0.00575043   -0.0124951       -0.00573017   -0.0103429  
 -0.0128464     0.010258      0.00110475   …  -0.00424998    0.0124296  
 -0.0121471     0.00305141   -0.00754556      -0.0104162     0.00554306 
 -0.0234305    -0.00241126   -0.010413         0.000129339   0.0166282  
  ⋮                      

In [32]:
ind_rs11672206_2 = find(x -> x == "rs11672206", snpid_2)[1]

236079

In [33]:
snp_rs11672206_2 = convert(Array{Float64,1}, snpbin28_2[:, ind_rs11672206_2])

212-element Array{Float64,1}:
 0.0
 2.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 0.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0

## Example 2 Mendelian Randomization VCM:

In [34]:
#Example 2
MendelianRandomization(Y_2, [ones(length(Y_2), 1) sex_2], snp_rs11672206_2, exp_2, Î¦grm_snp28_2)

MR direct effects (SE): 0.24374813445595547(0.1523385536136043), Pvalue is 0.10958919289246363

Exposure effect (SE) is: -1.8980279594308567(0.25670960108586244)

Trait effect (SE) is: -0.46264077425651523(0.2822910952147574)



(0.24374813445595547, 0.1523385536136043)

Again we simulated the effect of the IV on exposure to be strong. In this case, 
each 2 allele of the SNP reduces the exposure by -1.5 units. We simulated the trait and exposure so that they are correlated but due to unmeasured confounders not due to a direct effect. 


In [35]:
###### Second Example Exposure Effect on Trait using VCM ####
X = [ones(length(Y_2), 1) sex_2]
exposure = exp_2
outcome = Y_2
ΦGRM = Î¦grm_snp28_2
ExposureEffectVCM(outcome, X, exposure, ΦGRM)

(0.4994751780732781, 0.05399663242249682)

Regressing trait on exposure in this case, would lead a research to think that the exposure effects the trait but in fact, from the MR analysis we can see there is little evidence of a direct effect.

# 3rd Example: Direct effect of exposure on trait exists but the IV is weak. 

In this example we use the Ped28trait3.out data file, simulated under the scenario where the genetic marker is a weak IV to demonstrate the problem of violating one of the essential assumptions of MR.

In [45]:
ped28_3 = readcsv("Ped28trait3.out", Any; header = false)

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  40.1878  214.434  ""
     1   8228       "          "     "          "  47.0623  250.469  ""
     1  17008       "          "     "          "  44.8772  249.917  ""
     1   9218  17008                 "          "  41.4107  231.419  ""
     1   3226   9218                 "          "  40.3835  217.034  ""
     2     29       "          "  …  "          "  39.6165  212.84   ""
     2   2294       "          "     "          "  38.8691  220.87   ""
     2   3416       "          "     "          "  48.9882  271.711  ""
     2   3916   2294                 "          "  45.129   240.725  ""
     2   6790   2294                 "          "  43.2185  240.717  ""
     2  14695   2294              …  "          "  38.846   210.021  ""
     2  17893   2294                 "          "  43.9826  234.954  ""
     2   6952   3416                 "          "  50.7849  280.457  ""
     ⋮                            ⋱  ⋮      

In [46]:
exp_3 = convert(Array{Float64,1}, ped28_3[:, 7])

212-element Array{Float64,1}:
 40.1878
 47.0623
 44.8772
 41.4107
 40.3835
 39.6165
 38.8691
 48.9882
 45.129 
 43.2185
 38.846 
 43.9826
 50.7849
  ⋮     
 41.1275
 38.5637
 35.6761
 32.7235
 37.7896
 38.5346
 45.7092
 37.4253
 45.5548
 46.7553
 49.9749
 44.6052

In [47]:
Y_3 = convert(Array{Float64,1}, ped28_3[:, 8])

212-element Array{Float64,1}:
 214.434
 250.469
 249.917
 231.419
 217.034
 212.84 
 220.87 
 271.711
 240.725
 240.717
 210.021
 234.954
 280.457
   ⋮    
 231.582
 208.184
 192.605
 179.584
 217.257
 219.802
 252.089
 209.743
 254.076
 258.965
 271.941
 246.865

In [48]:
sex_3 = map(x -> strip(x) == "F"? 1.0 : 0.0, ped28_3[:, 5])

212-element Array{Float64,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 1.0
 0.0
 ⋮  
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

In [49]:
snpdef28_3 = readcsv("SNP_def28trait3.out", Any; skipstart = 2, header = false)

253141×6 Array{Any,2}:
 "rs3020701       "  19     90974  1  2  ""
 "rs56343121      "  19     91106  1  2  ""
 "rs143501051     "  19     93542  1  2  ""
 "rs56182540      "  19     95981  1  2  ""
 "rs7260412       "  19    105021  1  2  ""
 "rs11669393      "  19    107866  1  2  ""
 "rs181646587     "  19    107894  1  2  ""
 "rs8106297       "  19    107958  1  2  ""
 "rs8106302       "  19    107962  1  2  ""
 "rs183568620     "  19    107987  1  2  ""
 "rs186451972     "  19    108003  1  2  ""
 "rs189699222     "  19    108032  1  2  ""
 "rs182902214     "  19    108090  1  2  ""
 ⋮                                       ⋮ 
 "rs188169422     "  19  59116080  1  2  ""
 "rs144587467     "  19  59117729  1  2  ""
 "rs139879509     "  19  59117949  1  2  ""
 "rs143250448     "  19  59117982  1  2  ""
 "rs145384750     "  19  59118028  1  2  ""
 "rs149215836     "  19  59118040  1  2  ""
 "rs139221927     "  19  59118044  1  2  ""
 "rs181848453     "  19  59118114  1  2  ""
 "rs13831

In [50]:
snpid = map(x -> strip(string(x)), snpdef28_3[:, 1])

253141-element Array{AbstractString,1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

In [51]:
;ln -s ./SNP_data28d_trait3.bin ./SNP_data28d_trait3.bed

ln: ./SNP_data28d_trait3.bed: File exists


Again remember to use the SNP bed file that corresponds to the fam file and snp ids ordering or you can get incorrect results.

In [52]:
snpbin28_3 = SnpArray("SNP_data28d_trait3.bed"; people = size(ped28_3, 1), snps = size(snpdef28_3, 1))

[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m

212×253141 SnpArrays.SnpArray{2}:
 (true, true)  (true, true)   …  (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)      (true, true)    (false, false)
 (true, true)  (true, true)   …  (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (true, true)      (false, false)  (true, true)  
 (true, true)  (false, true)  …  (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 (true, true)  (false, true)     (false, false)  (true, true)  
 ⋮                            ⋱                  ⋮             
 (true, true)  (false, true)  …  (true, true)    (false, false)
 (true

In [53]:
Î¦grm_snp28_3 = grm(snpbin28_3)

212×212 Array{Float64,2}:
  0.488859      0.00422798    0.00978752   …   0.0187809     0.000152741
  0.00422798    0.515141     -0.0190676       -0.0220944    -0.0162381  
  0.00978752   -0.0190676     0.489156        -0.0143552    -0.00392193 
  0.241927     -0.00231651    0.266756         0.00264766   -0.00334162 
  0.124909      0.265109      0.12103         -0.0118053    -0.00641433 
 -0.0121766    -0.000271784  -0.00316506   …  -0.000629837   0.0103776  
 -0.011988      0.00845464   -0.00950518      -0.0105435     0.000340285
 -0.0134368    -0.00713175   -0.0101011       -0.0115942    -0.00288935 
 -0.0234305    -0.00241126   -0.010413         0.000129339   0.0166282  
 -0.0121471     0.00305141   -0.00754556      -0.0104162     0.00554306 
 -0.0128464     0.010258      0.00110475   …  -0.00424998    0.0124296  
 -0.0111854     0.00423488   -0.0117065       -0.0244977    -0.00698659 
 -0.0124607    -0.00575043   -0.0124951       -0.00573017   -0.0103429  
  ⋮                      

Now instead of using the actual IV we use a SNP that is in LD with the IV.

In [61]:
ind_rs7255584 = find(x -> x == "rs7255584", snpid)[1]

236082

In [62]:
snp_rs7255584_3 = convert(Array{Float64,1}, snpbin28_3[:, ind_rs7255584])

212-element Array{Float64,1}:
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 1.0
 ⋮  
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0

## Example 3 Mendelian Randomization VCM:

In [63]:
MendelianRandomization(Y_3,[ones(length(Y_3),1) sex_3], snp_rs7255584_3, exp_3, Î¦grm_snp28_3)

MR direct effects (SE): 2.4467025863210314(40.916255981332085), Pvalue is 0.9523166681651045

Exposure effect (SE) is: 0.13008389692292568(0.9573567708345674)

Trait effect (SE) is: 0.31827660704004074(4.779415412555627)



(2.4467025863210314, 40.916255981332085)

Note that the SNP effect on exposure (exposure effect) is weak. The SNP effect with the trait is also poorly estimated and so the estimate of the MR is underestimated. The true value is 5.0 units change in trait value for every 1.0 unit of exposure but the MR estimate is only half of that. 

## Example 3 Exposure Effect VCM:

In [64]:
###### Third Example Exposure Effect on Trait using VCM ####
X = [ones(length(Y_3), 1) sex_3]
exposure = exp_3
outcome = Y_3
ΦGRM = Î¦grm_snp28_3
ExposureEffectVCM(outcome, X, exposure, ΦGRM)

(4.976393615287649, 0.021010620094895217)