# Vignette of phenotype simulation with genotype

Here we provided how to simulate the effect sizes and phenotypes with the input of genoptype matrix (from real data or simulated)

In [5]:
# Load functions in simxQTL
library(simxQTL)

devtools::load_all("/home/hs3393/pecotmr")

In [8]:
# use read_plink to read PLINK format data
library("MASS")
library("plink2R")
library("tidyverse")
geno <- read_plink("../data/example")

In [9]:
# Common filtering: only keep the variants with missing rate < 0.1 and maf > 0.05
imiss = 0.1
maf = 0.05
# filter_X also serves the function to remove the columns with zero (or small variance specified by var_thresh) variance, impute columns by mean
Xmat = filter_X(geno$bed, imiss, maf)

The simulation strategy is given below.

## Total heritability ($\phi_{total}$) for a block: formula deduction

Effect size are all set to be 1. This value actually doesn't matter so much because phi will eventually control the variance. Under this case, all SNPs share the same effect size, and they altogether contribute to explain $\phi_{total}$ (eg.0.5) variance of the total variance. To get Y, we assume a multivariate gaussian distribution $\textbf{Y} \sim N(\textbf{X} \beta, \sigma^2)$, and $\sigma^2$ can be estimated by the equation below.

$\phi_{total} = \dfrac{var(X \beta)}{\sigma^2 + var(X \beta)}$

$\sigma^2 = \frac{var(X \beta)(1-\phi_{total})}{\phi_{total}}$ 

## SNP level heritability ($\phi_{SNP}$) formula deduction

1. Assume there are total number of $a$ causal variants. We assign $\beta_1 = 1$ as the initialize setting.

2. For each causal variant we have:

$\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_2 \beta_2)}{Var(Y)} = ... = \frac{Var(X_a \beta_a)}{Var(Y)} = \phi_{SNP}$

3. In that way, we have: if $\beta_1^2 Var(X_1) = \beta_2^2Var(X_2) = ... = \beta_a^2 Var(X_a)$. Then we have $\beta_2 = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_2)}}$, ..., $\beta_a = \sqrt{\frac{\beta_1^2 Var(X_1)}{Var(X_a)}}$.

5. Then we will have
$\frac{Var(X_1 \beta_1)}{Var(Y)} = \frac{Var(X_1 \beta_1)}{Var(X \beta) + \sigma^2} = \phi_{SNP}$. Therefore, we have $\sigma^2 = \frac{Var(X_1 \beta_1)}{\phi_{SNP}} - Var(X \beta)$

## Simulate by functions simulate_linreg.R

### Step 1: Get effect sizes

In [14]:
# specify the number of causal variant for all traits
ncausal = 2
# specify the number of traits
ntrait = 2


shared_pattern = "all"

# is_h2_total: TRUE or FALSE corresponding to one of the simulation strategy
# shared_pattern: for causal variants, are they the same for all traits? or totally random across traits
B = sim_beta(G = Xmat, ncausal = ncausal , ntrait, 
                 is_h2g_total = TRUE, 
                 shared_pattern = "all")

In [15]:
str(B)

 num [1:947, 1:2] 0 0 0 0 0 0 0 0 0 0 ...


B will be a matrix with dimension m (number of variants) * n (simulated traits), with non-zero entry for causal variants.

In [16]:
which(B[,1] != 0)
which(B[,2] != 0)

We have shared pattern = "all" so the two variants are all causal variants in both trait 1 and trait2.

In [22]:
B = sim_beta(G = Xmat, ncausal = ncausal , ntrait, 
                 is_h2g_total = TRUE, 
                 shared_pattern = "random")

In [23]:
which(B[,1] != 0)
which(B[,2] != 0)

**Also when you have specified variant index**, you can clearly specify their variant index in this region.

In [29]:
# directly assign som 

causal_index = c(100, 500)

B = sim_beta_fix_variant(G = Xmat, causal_index = causal_index, is_h2g_total = FALSE)
which(B[,1] != 0)

### Step 2: calculate Y (trait; phenotype)

Here using this function you can simulate multiple traits at the same time, based on the effect size matrix B. These phenotypes can be dependent or independent.

In [38]:
phenotype = sim_multi_traits(G =  Xmat, B = B, h2g = 0.005, is_h2g_total = FALSE, residual_corr = NULL)
str(phenotype)

List of 2
 $ P           : num [1:489, 1] 14.432 -1.792 3.004 0.935 0.344 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:489] "HG00096:HG00096" "HG00097:HG00097" "HG00099:HG00099" "HG00101:HG00101" ...
  .. ..$ : chr "Trait_1"
 $ residual_var: num [1, 1] 85.7


The output will have 2 elements: 

1. P: a p (sample number) * m (phenotype number) matrix
2. residual_var: residual variance/covariance matrix based on the input of residual_corr (default NULL, using correlation matrix of diagnal matrix between traits)

In [39]:
phenotype = phenotype$P

X = Xmat
Y = phenotype

### Now we have X (genotype matrix) and Y (phenotype, trait) pairs!

### (Optional) Step 3: Convert them to summary statistics & LD

In [40]:
trait = calculate_sumstat(X, unname(unlist(Y[,1])))

In [41]:
head(trait)

SNP,Beta,se,Freq,p,z
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
rs2773869,-0.414316759,0.7201517,0.194274,0.5650757,-0.575318754
rs2247680,-0.458417826,0.9401589,0.105317,0.6258359,-0.487596127
rs944214,0.632401094,0.6615128,0.2576687,0.3390762,0.955992205
rs944215,-0.287842618,0.7397778,0.1860941,0.6972071,-0.389093344
rs2773870,0.006495443,0.7069575,0.2034765,0.9926692,0.009187882
rs944216,0.006495443,0.7069575,0.2034765,0.9926692,0.009187882


This is using univariate regression and you can get sumstat of each variant. 

In [None]:
If you want LD you can simply use cor(X) or a faster way:

In [44]:
LD = get_correlation(X)

In [45]:
head(LD)

Unnamed: 0,rs2773869,rs2247680,rs944214,rs944215,rs2773870,rs944216,rs2773871,rs944217,rs944218,rs2416854,⋯,rs75194338,rs76771831,rs2900207,rs2185561,rs4836894,rs4836895,rs4837991,rs4836896,rs4837992,rs3962623
rs2773869,1.0,0.5831338,0.5018285,0.595847,0.6041128,0.6041128,0.6007688,0.6041128,0.6041128,0.3253454,⋯,0.07185063,-0.05350983,0.001825327,0.001825327,0.073952259,-0.06802515,0.07185063,0.073952259,0.07185063,-0.06802515
rs2247680,0.5831338,1.0,0.574926,0.7194841,0.6909864,0.6909864,0.6998237,0.6909864,0.6909864,0.2252858,⋯,-0.004744519,-0.00344168,0.019282826,0.019282826,-0.003234928,-0.01216305,-0.004744519,-0.003234928,-0.004744519,-0.01216305
rs944214,0.5018285,0.574926,1.0,0.8015244,0.8443176,0.8443176,0.8389883,0.8443176,0.8443176,0.5831042,⋯,0.018064229,-0.05804925,0.03316515,0.03316515,0.015614323,-0.04517601,0.018064229,0.015614323,0.018064229,-0.04517601
rs944215,0.595847,0.7194841,0.8015244,1.0,0.9466962,0.9466962,0.9522023,0.9466962,0.9466962,0.7280007,⋯,0.038366569,-0.0245557,0.021031703,0.021031703,0.040447197,-0.05353989,0.038366569,0.040447197,0.038366569,-0.05353989
rs2773870,0.6041128,0.6909864,0.8443176,0.9466962,1.0,1.0,0.9936782,1.0,1.0,0.6803265,⋯,0.04197591,-0.04445553,0.04428367,0.04428367,0.044148854,-0.0768368,0.04197591,0.044148854,0.04197591,-0.0768368
rs944216,0.6041128,0.6909864,0.8443176,0.9466962,1.0,1.0,0.9936782,1.0,1.0,0.6803265,⋯,0.04197591,-0.04445553,0.04428367,0.04428367,0.044148854,-0.0768368,0.04197591,0.044148854,0.04197591,-0.0768368
