# Notebook

Polygenic risk score (PRSs) -> predict complex disease risk

sample size increase -> more accurate

Standard approach for calculating risk scores: linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics.

## 1. Introduction

Current PRS methods do not account for the effects of linkage disequilibrium (LD)

Solutions:

* require genotype data as input (genomic BLUP)

* LDpred, a Bayesian PRS that estimates posterior mean causaul effect sizes from GWAS (genome-wide association study) summary statistics by assuming a prior for the genetic architecture and LD information from a reference panel.

Compare LDpred, P+T and other approaches

## 2. Methods

**LDpred** calculates the **posterior mean effects** from GWAS summary statistics by conditioning on a **genetic *architecture* prior** and **LD information from a reference panel**

1. genetic *architecture* prior

The **prior for effect sizes** is a point-normal mixture distribution (allows for non-infinitesimal genetic *architectures*), has 2 parameters: 

* the **heritability (parameter) explained by the genotypes**

estimated from GWAS summary statistiscs and accounts for sampling noise and LD (details

and 

* **the fraction of causal markers** (i.e., the fraction of markers with non-zero effects)

**multiple LDpred risk scores** are calculated with the use of priors with varying fractions of markers with non-zero effects.

2. LD information from a reference panel

3. Estimate the **posterior mean effect sizes** via the **Markov chain Monte Carlo (MCMC)** method and apply them to validation data to **obtain PRSs**


### 2.1 Phenotype Model

Y(N * 1),X (N * M): centered and standardized phenotype vector and genotype matrix

$$
Y=\sum_{i=1}^MX_i\beta_i+\epsilon
$$

$\epsilon$: environmental and noise contribution

Idealy, The (marginal) least-squares estimate of an individual marker effect is $\hat\beta_i=X_i^\prime Y/N$

In practice, with other summary statistics such as p value and direction of the effect estimates. $\hat\beta_i=s_i(z_i/\sqrt n)$

### 2.2 Unadjusted PRS

$$
S_i=\sum_{j=1}^M X_{ij} \hat\beta_j
$$

### 2.3 P+T

Informed LD pruning (LD clumping) (like lasso?) and applying p value thresholding.

### 2.4 Bpred: Bayesian Approach in the Special Case of No LD

The quantity of interest is the posterior mean marker effect given LD information from the GWAS sample and the GWAS summary statistics.

$$
E(Y \mid \tilde{\beta}, \widehat{D})=\sum_{i=1}^{M} X_{i}^{\prime} E\left(\beta_{i} \mid \tilde{\beta}, \widehat{D}\right)
$$

Estimate the local LD structure in the training data from the independent validation data

The variance of the trait:

$$
\operatorname{Var}(Y)=h_{g}^{2} \Theta+\left(1-h_{g}^{2}\right) \mathrm{I}
$$

where $h_{g}^{2}$ denotes the **heritability explained by the genotyped variants**, and $\Theta=X X^{\prime} / M$ is the SNP-based genetic **relationship matrix**. 

We can obtain a trait with the desired covariance structure if we sample the betas independently with mean 0 and variance $h_g^2/M$.

---

**Infinitesimal model** considering following Gaussian prior:
$$
\beta_{i} \sim_{i i d} N\left(0,\left(h_{g}^{2} / M\right)\right)
$$
Posterior mean:
$$
E\left(\beta_{i} \mid \tilde{\beta}\right)=E\left(\beta_{i} \mid \tilde{\beta}_{i}\right)=\left(\frac{h_{g}^{2}}{h_{g}^{2}+\frac{M}{N}}\right) \tilde{\beta}_{i}
$$
Expected squared correlation between the unadjusted PRS
(with **unlinked** markers) and the phenotype:
$$
\left(\frac{h_{g}^{2}}{h_{g}^{2}+\frac{M}{N}}\right) h_{g}^{2}
$$
---
**Non-infinitesimal model** considering following Gaussian mixture prior:
$$
\beta_{i} \sim_{i i d}\left\{\begin{array}{c}
N\left(0, \frac{h_{g}^{2}}{M p}\right) \text { with probability } p \\
0 \text { with probability }(1-p)
\end{array}\right.
$$
where $p$ is the probability that a marker is drawn from a Gaussian distribution, i.e., the fraction of causal markers. 

Posterior mean:
$$
\mathrm{E}\left(\beta_{i} \mid \tilde{\beta}_{i}\right)=\left(\frac{h_{g}^{2}}{h_{g}^{2}+\frac{M p}{N}}\right) \bar{p}_{i} \tilde{\beta}_{i}
$$
where $\bar{p}_{i}$ is the posterior probability that the $i^{\text {th }}$ marker is causal and can be calculated analytically.

### 2.5 LDpred: Bayesian Approach in the Presence of LD

Assume distant markers are unlinked, the posterior mean for the effect sizes *within a small region l* under an **infinitesimal model** (_LDpred-inf_)
$$
E\left(\beta^{l} \mid \tilde{\beta}^{l}, D\right) \approx\left(\frac{M}{N h_{g}^{2}} I+D_{l}\right)^{-1} \tilde{\beta}^{l}
$$

LDpred-inf is therefore a natural extension of the genomic BLUP to summary statistics

---

LDpred approximates the posterior mean under a
non-infinitesimal Gaussian mixture prior numerically by using an approximate **MCMC Gibbs sampler** (Appendix A)

Ensure Convergence:

shinkage factor $c=\min \left(1,\left(\widehat{h}_{g}^{2} /\left(\tilde{h}_{g}^{2}\right)_{i}\right)\right)$ used to shink $\bar{p}_{i}$, where $\hat{h}_{g}^{2}$ is the estimated heritability based on an aggregate approach (see below), and $\left(\tilde{h}_{g}^{2}\right)_{i}$ is the estimated genome-wide heritability at each big iteration.

Practical Considerations:

LD radius and the fraction p of non-zero effects in the prior these two parameters of LDpred

When using LDpred, we recommend that SNP weights (posterior mean effect sizes) are calculated for exactly the SNPs used in the validation data (intersection)

### 2.6 Some dataset??? (No idea)

* WTCCC Genotype Data

* Summary Statistics and Independent Validation Datasets

* SCZ Validation Datasets with Non-European Ancestry

### 2.7 Prediction-Accuracy Metrics

For quantitative traits, Squared correlation ($R^2$)

For case-control traits, Nagelkerke R2, observed-scale R2, liability-scale R2, and the area under the curve (AUC).

All of the reported prediction R2 values were adjusted for the top five principal components (PCs) in the validation sample (top three PCs for BC) ????? (No idea）

....