# Trait Simulation - Ordinal Multinomial Power

Authors: Sarah Ji, Janet Sinsheimer, Kenneth Lange, Hua Zhou

In this notebook we show how to use the `TraitSimulation.jl` package we illustrate how TraitSimulation.jl can easily simulate traits from genotype data, all within the OpenMendel universe. Operating within this universe brings potential advantages over the available software(s) when needed for downstream analysis or study design. 

Using just a few calls on the command line to the appropriate packages within the OpenMendel, we demonstrate in three easy examples the utilities of the TraitSimulation.jl package.


## Background

There is a lack of software available to geneticists who wish to calculate power and sample sizes in designing a study on genetics data. Typically, the study power depends on assumptions about the underlying disease model.  Many power calculating software tools operate as a black box and do not allow for customization.  To develop custom tests, researchers can develop their own simulation procedures to carry out power calculations.  One limitation with many existing methods for simulating traits conditional on genotypes is that these methods are limited to normally distributed traits and to fixed effects. 

This software package, TraitSimuliation.jl addresses the need for simulated trait data in genetic analyses.  This package generates data sets that will allow researchers to accurately check the validity of programs and to calculate power for their proposed studies. This package gives users the ability to easily simulate phenotypic traits under generalized linear models (GLMs) or variance component models (VCMs) conditional on PLINK formatted genotype data [3]. In addition, we include customized simulation utilities that accompany specific genetic analysis options in Open-Mendel; for example, ordered, multinomial traits. We demonstrate these simulation utilities on the example dataset described below.


## Demonstration

##### Example Data

We use the OpenMendel package [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/) to both read in and write out PLINK formatted files. 

Based on several different features in the EHR including diabetes diagnostic codes, diabetes medication, hyperglycemia in blood results defined by HbA1c and fasting glucose levels, and presence of diabetes process of care codes, the algorithm categorizes individuals into different categories that relate to how likely they are to have diabetes. 

For convenience we use the common assumption that the residual covariance among two relatives can be captured by the additive genetic variance times twice the kinship coefficient.

In each example the user can specify the simulation model parameters, along with the number of repitions for each simulation model as desired. By default, the simulation will return the result of a single simulation.


### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [1]:
versioninfo()

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


In [2]:
using Random, Plots, DataFrames, LinearAlgebra, StatsFuns, CSV
using SnpArrays, TraitSimulation, GLM, StatsBase, OrdinalMultinomialModels
Random.seed!(1234);

┌ Info: Precompiling TraitSimulation [dec3038e-29bc-11e9-2207-9f3d5855a202]
└ @ Base loading.jl:1273


# Reading genotype data using SnpArrays

First use `SnpArrays.jl` to read in the genotype data. We use PLINK formatted data with the same prefixes for the .bim, .fam, .bed files.

SnpArrays is a very useful utility and can do a lot more than just read in the data. More information about all the functionality of SnpArrays can be found at:
https://openmendel.github.io/SnpArrays.jl/latest/

As missing genotypes are often due to problems making the calls, the called genotypes at a marker with too much missing genotypes are potentially unreliable. By default, SnpArrays filters to keep only the genotypes with success rates greater than 0.98 and the minimum minor allele frequency to be 0.01. If the user wishes to change the stringency, change the number given in filter according to [SnpArrays](https://openmendel.github.io/SnpArrays.jl/latest/#Fitering-1).

In [3]:
filename = "/mnt/UKBiobank/ukbdata/ordinalanalysis/ukb.plink.filtered"
full_snps = SnpArray(filename * ".bed");

In [11]:
full_snp_data = SnpData(SnpArrays.datadir(filename))

SnpData(people: 185565, snps: 470228,
snp_info: 
│ Row │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│     │ String     │ String      │ Float64          │ Int64    │ Categorical… │ Categorical… │
├─────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1   │ 1          │ rs3131972   │ 0.0              │ 752721   │ A            │ G            │
│ 2   │ 1          │ rs12184325  │ 0.0              │ 754105   │ T            │ C            │
│ 3   │ 1          │ rs3131962   │ 0.0              │ 756604   │ A            │ G            │
│ 4   │ 1          │ rs12562034  │ 0.0              │ 768448   │ A            │ G            │
│ 5   │ 1          │ rs116390263 │ 0.0              │ 772927   │ T            │ C            │
│ 6   │ 1          │ rs4040617   │ 0.0              │ 779322   │ G            │ A            │
…,
person_info: 
│ Row │ fid       │ iid       │ father    │ mother    │ sex       │ phenotype │

In [107]:
snpid = full_snp_data.snp_info[!, :snpid][:] # store the snp_info with the snp names
causal_snp_index = findall(x -> x == "rs11240779", snpid) # find the index of the snp of interest by snpid

1-element Array{Int64,1}:
 7

The published hypertension GWAS analysis includes the following covariates: sex, center, age, age2, BMI, and the top ten principal components to adjust for ancestry/relatedness.

In [110]:
causal_snp = @view full_snps[:, causal_snp_index]

185565×1 view(::SnpArray, :, [7]) with eltype UInt8:
 0x03
 0x03
 0x02
 0x02
 0x03
 0x03
 0x00
 0x03
 0x03
 0x03
 0x00
 0x02
 0x03
    ⋮
 0x02
 0x03
 0x02
 0x03
 0x02
 0x03
 0x03
 0x03
 0x03
 0x02
 0x00
 0x03

In [113]:
maf_cs = maf(@view full_snps[:, causal_snp_index])

1-element Array{Float64,1}:
 0.22473754170848814

In [6]:
# Generating Effect Sizes from Chisquared(df = 1) density
chisq_coeff = round(chisqpdf(1, maf_cs), digits = 3)

0.807

# Power Calculation

Now we show how to simulate from customized simulation models that accompany specific genetic analysis options in OpenMendel; for example, ordered, multinomial traits and Variance Component Models.


This example illustrates the use of the simulations to generates data sets allowing researchers to accurately check the validity of programs and to calculate power for their proposed studies. 

We illustrate this example in three digestable steps: 
   * The first by simulating genotypes and covariate values representative for our study population.
   * Carry over the simulated design matrix from (1) to create the OrderedMultinomialTrait model object.
   * Simulate off the OrderedMultinomialTrait model object created in (2) and run the power analyses for the desired significance level.


### `Genotype Simulation:`

Say our study population has a sample size of `n` people and we are interested in studying the effect of the causal snp with a predetermined minor allele frequency. We use the minor allele frequency of the causal variant to simulate the SnpArray under Hardy Weinberg Equillibrium (HWE), using the `snparray_simulation` function as follows:
    
    
| Genotype | Plink/SnpArray |  
|:---:|:---:|  
| A1,A1 | 0x00 |  
| missing | 0x01 |
| A1,A2 | 0x02 |  
| A2,A2 | 0x03 |  
    

Given the specified minor allele frequency, `maf`, here `maf = [0.2]`, this function samples from the genotype vector under HWE and returns the compressed binary format under SnpArrays. Note if you give the function a vector of minor allele frequencies, specify `maf = [0.2, 0.25, 0.3]`, for each specified allele it will simulate a SnpArray under HWE and ouput them together.

### convert
By default `convert` function translates genotypes according to the *additive* SNP model, which essentially counts the number of **A2** allele (0, 1 or 2) per genotype. Other SNP models are *dominant* and *recessive*, both in terms of the **A2** allele.

| Genotype | `SnpArray` | `model=ADDITIVE_MODEL` | `model=DOMINANT_MODEL` | `model=RECESSIVE_MODEL` |    
|:---:|:---:|:---:|:---:|:---:|  
| A1,A1 | 0x00 | 0 | 0 | 0 |  
| missing | 0x01 | NaN | NaN | NaN |
| A1,A2 | 0x02 | 1 | 1 | 0 |  
| A2,A2 | 0x03 | 2 | 1 | 1 |  

If desired, the user can decide to specify alternative model parameters found in the [SnpArrays](https://openmendel.github.io/SnpArrays.jl/latest/#convert-and-copyto!-1) documentation.

We want to find the index of this causal locus in the snp_definition (.bim) file and then subset that locus from the genetic marker data above.
Using SnpArrays.jl we can then use the `convert` and `@view` commands to get the appropriate conversion from SnpArray to a computable vector of Float64. 

In [8]:
locus = convert(Vector{Float64}, @view(full_snps[:, 7]), impute = true)

185565-element Array{Float64,1}:
 2.0
 2.0
 1.0
 1.0
 2.0
 2.0
 0.0
 2.0
 2.0
 2.0
 0.0
 1.0
 2.0
 ⋮  
 1.0
 2.0
 1.0
 2.0
 1.0
 2.0
 2.0
 2.0
 2.0
 1.0
 0.0
 2.0

The published hypertension GWAS analysis includes the following covariates: sex, center, age, age2, BMI, and the top ten principal components to adjust for ancestry/relatedness. 

In [114]:
published_covariate_data = CSV.read("/mnt/UKBiobank/ukbdata/ordinalanalysis/Covariate_Final.csv")
covariates = published_covariate_data

Unnamed: 0_level_0,FID,IID,sex,center,age,age2,chip,bmi,hyptens,AveSBP
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,String,Float64,Int64,Float64
1,1000019,1000019,1,11016,62,3844,ChipOne,29.3424,0,118.0
2,1000078,1000078,1,11009,51,2601,ChipOne,31.0375,2,134.0
3,1000081,1000081,1,11014,47,2209,ChipOne,23.335,2,132.5
4,1000105,1000105,0,11010,42,1764,ChipOne,21.5793,0,104.5
5,1000112,1000112,1,11021,55,3025,ChipOne,26.7327,2,139.0
6,1000129,1000129,1,11009,63,3969,ChipOne,26.7034,3,141.0
7,1000141,1000141,0,11008,51,2601,ChipOne,35.1706,1,128.0
8,1000164,1000164,0,11011,45,2025,ChipOne,25.235,1,122.0
9,1000224,1000224,1,11010,52,2704,ChipOne,31.6387,3,158.0
10,1000236,1000236,1,11011,46,2116,ChipOne,22.6394,3,153.5


In [115]:
pcs = covariates[:, 12:21]

Unnamed: 0_level_0,PC1,PC2,PC3,PC4,PC5,PC6
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0131855,-0.00507092,3.83532e-5,0.00318524,-0.00124501,-0.00610784
2,-0.0103569,-0.00120657,-0.00180968,-0.00439243,0.000870917,0.00139868
3,0.00190981,0.000241873,0.00570492,0.00268532,-0.00481082,0.00248758
4,0.00338857,-0.00488271,0.00842367,-0.00717622,0.0165991,0.00943267
5,0.00532039,0.00761606,-0.00517797,0.00159228,-0.00697205,-0.000354593
6,-0.000972037,-0.00743359,-0.00502376,0.00480728,0.0109095,-0.00385346
7,-0.0129816,-0.00462359,0.010976,0.0125487,-0.00887052,0.00243781
8,-0.000916772,-0.00210423,-0.00842536,-0.0151277,-0.00909347,-0.00166054
9,0.0163709,-0.0148821,-0.00619662,0.000587472,0.00554896,-0.00301778
10,0.0164032,-0.00471488,0.00896406,0.00221101,0.00414722,-0.00437688


In [117]:
age = Float64.(covariates[!, :age])
sex = Float64.(covariates[!, :sex])
center = Int64.(covariates[!, :center])
bmi = Float64.(covariates[!, :bmi])

y = Float64.(covariates[!, :hyptens])
X = DataFrame(y = y, sex = sex, center = center, age = age, age2 = age.^2, bmi = bmi, locus = locus)
X_full = Float64.(hcat(pcs, X))

Unnamed: 0_level_0,PC1,PC2,PC3,PC4,PC5,PC6
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0131855,-0.00507092,3.83532e-5,0.00318524,-0.00124501,-0.00610784
2,-0.0103569,-0.00120657,-0.00180968,-0.00439243,0.000870917,0.00139868
3,0.00190981,0.000241873,0.00570492,0.00268532,-0.00481082,0.00248758
4,0.00338857,-0.00488271,0.00842367,-0.00717622,0.0165991,0.00943267
5,0.00532039,0.00761606,-0.00517797,0.00159228,-0.00697205,-0.000354593
6,-0.000972037,-0.00743359,-0.00502376,0.00480728,0.0109095,-0.00385346
7,-0.0129816,-0.00462359,0.010976,0.0125487,-0.00887052,0.00243781
8,-0.000916772,-0.00210423,-0.00842536,-0.0151277,-0.00909347,-0.00166054
9,0.0163709,-0.0148821,-0.00619662,0.000587472,0.00554896,-0.00301778
10,0.0164032,-0.00471488,0.00896406,0.00221101,0.00414722,-0.00437688


## Phenotype Simulation:

Now that we have our simulated design matrix with the desired SNP of interest, we can simulate our phenotypes on the simulated data under different TraitSimulation models. To illustrate, we demonstrate on the `OrderedMultinomialTrait` model object in TraitSimulation.jl.


### Ordered Multinomial Trait

Recall that this phenotype is special, in that the [OrdinalMultinomialModels](https://openmendel.github.io/OrdinalMultinomialModels.jl/stable/#Syntax-1) package provides Julia utilities to fit ordered multinomial models, including [proportional odds model](https://en.wikipedia.org/wiki/Ordered_logit) and [ordered Probit model](https://en.wikipedia.org/wiki/Ordered_probit) as special cases. 

In [125]:
link = LogitLink()
θ = [1.0, 1.2, 1.4]
β_non_gen = 
[ 0.168
  0.150 
 -0.060 
  0.160
 -1.386
 -0.118
 -0.231
 -0.555
 -0.727
  0.858
  0.992
  0.004
 -0.001  
 -0.002
  0.001  
  0.003]

β_full = vcat(β_non_gen, chisq_coeff)

17-element Array{Float64,1}:
  0.168
  0.15 
 -0.06 
  0.16 
 -1.386
 -0.118
 -0.231
 -0.555
 -0.727
  0.858
  0.992
  0.004
 -0.001
 -0.002
  0.001
  0.003
  0.807

In [126]:
Ordinal_Model_Test = OrderedMultinomialTrait(Matrix(X_full), β_full, θ, link)

Ordinal Multinomial Model
  * number of fixed effects: 17
  * number of ordinal multinomial outcome categories: 4
  * link function: LogitLink
  * sample size: 185565

### Simulate Ordered Multinomial Logistic

Specific to the Ordered Multinomial Logistic model is the option to transform the multinomial outcome (i.e 1, 2, 3, 4) into a binary outcome for logistic regression. 

Although by default is the multinomial simulation above, the user can simulate from the transformed logistic outcome for example by specifying arguments: `Logistic = true` and `threshold = 2` the value to use as a cutoff for identifying cases and controls. **(i.e if y > 2 => y_logit == 1).** We note if you specify `Logistic = true` and do not provide a threshold value, the function will throw an error to remind you to specify one.

## `Power Calculation:`

We use the following function to generate the p-values for the simulated power example for the ordered multinomial regression model. We range effect sizes in the vector γs, which collects effect sizes from 0 to 1 in increments of 0.05. As expected, the power increases as the effect size increases. 
    

In [128]:
γs = collect(0.0:0.025:0.5)

21-element Array{Float64,1}:
 0.0  
 0.025
 0.05 
 0.075
 0.1  
 0.125
 0.15 
 0.175
 0.2  
 0.225
 0.25 
 0.275
 0.3  
 0.325
 0.35 
 0.375
 0.4  
 0.425
 0.45 
 0.475
 0.5  

Each column of this matrix represents each of the detected effect sizes, and each row of this matrix represents each simulation for that effect size. The user feeds into the function the number of simulations, the vector of effect sizes, the TraitSimulation.jl model object, and the random seed.

For GLMTrait objects, the `realistic_power_simulation` function makes the appropriate calls to the GLM.jl package to get the simulation p-values obtained from testing the significance of the causal locus using the Wald Test by default. However since the GLM.jl package has its limitations, we include additional power utilities that make the appropriate function calls to the [OrdinalMultinomialModels](https://openmendel.github.io/OrdinalMultinomialModels.jl/stable/#Syntax-1) to get the p-value obtained from testing the significance of the causal locus.


For each effect size in $\gamma_s,$ in each column we have the p-values obtained from testing the significance of the causal locus `nsim = 100` times under the ordinal multinomial model, `Ordinal_Model` and the `randomseed = 1234`.

In [None]:
nsim = 1000
randomseed = 1234
simulated_pvalues = ordinal_power_simulation(nsim, γs, Ordinal_Model_Test, randomseed);
rename!(DataFrame(simulated_pvalues), [Symbol("γs = $(γs[i])") for i in 1:length(γs)])

Now we find the power of each effect size in the user-specified γs vector at the specified alpha level of significance, and plot the trajectory using the Plots.jl package.

In [None]:
α = 0.000005
power_effectsize = power(simulated_pvalues, α)

In [None]:
plot(γs, power_effectsize, title = "Multinomial Power", label = "maf = $maf_cs, alpha = $α", lw = 3 , legend = :bottomright, legendfontsize= 9)  # plot power
xlabel!("Detectable Effect Size")
hline!([.8], label = "power = 80%", lw = 3)
vline!([.29], label = "minimum detectable effect size = 0.29")
#savefig("/home/sarahji/TraitSimulation.jl/docs/ordinalmultinomialpower.pdf")

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).

[3] German, CA, Sinsheimer, JS, Klimentidis, YC, Zhou, H, Zhou, JJ. Ordered multinomial regression for genetic association analysis of ordinal phenotypes at Biobank scale. Genetic Epidemiology. 2019; 1– 13. https://doi.org/10.1002/gepi.22276
