# Proportion of Variance Explained and Heritability

Proportion of variance explained (PVE) measures how much of the total variation in a trait (like height or disease risk) can be attributed to specific variables in your statistical model (e.g., genetic variants). Heritability is a specific application of this concept that measures how much of the variation in a trait across a population can be explained by genetic differences.

# Graphical Summary

![PVE](./graphical_summary/Slide15.png)

# Key Formula

Any phenotype can be modeled as the sum of genetic and environmental effects, i.e., $\text{Phenotype}~(Y) = \text{Genotype}~(G) + \text{Environment}~(E)$, and under the assumption that G and E are independent from each other, the **proportion of variance explained (PVE)** by genetic effect alone (also called  broad-sense heritability $H^2$) can be derived as 

$$
\text{PVE} = H^2 = \frac{\text{Var}(G)}{\text{Var}(Y)}
$$

where:
- $\text{Var}(G)$ is the genetic variance component
- $\text{Var}(E)$ is the environmental variance component

# Technical Details

## Components of Variance

Any phenotype can be modeled as the sum of genetic and environmental effects:

$$\text{Phenotype}~(P) = \text{Genotype}~(G) + \text{Environment}~(E)$$

The phenotypic variance in the trait can then be partitioned as:

$$\text{Var}(P) = \text{Var}(G) + \text{Var}(E) + 2\text{Cov}(G,E)$$

Where:
- $\text{Var}(G)$ is the genetic variance component
- $\text{Var}(E)$ is the environmental variance component
- $\text{Cov}(G,E)$ is the covariance between genetic and environmental effects

## Broad-sense Heritability

In controlled experimental settings, we can design studies where $\text{Cov}(G,E)$ is minimized and effectively set to zero. In such cases, heritability is defined as the proportion of phenotypic variance attributable to all genetic effects:

$$H^2 = \frac{\text{Var}(G)}{\text{Var}(P)}$$

This represents the proportion of phenotypic variance attributable to genetic variance.

## Narrow-sense Heritability

**Narrow-sense heritability** ($h^2$): The proportion attributable to only **additive** genetic effects:

$$h^2 = \frac{\text{Var}(A)}{\text{Var}(P)}$$

Where $\text{Var}(A)$ is the additive genetic variance, a component of $\text{Var}(G)$. 

Other components of $\text{Var}(G)$ includes $\text{Var}(D)$ (dominance variance) and $\text{Var}(I)$ (epistatic variance, i.e., gene-gene interaction)


# Example

This example calculates the PVE for each genetic variant in a mixed-effects model framework. 

Following the example about random effect, the first genetic variant is the true causal variant and its true effect comes from the distribution $N(0,1)$. After simulating the data we fit a joint model and compute the proportion of variance explained by the genetic variants.

- Requirement: 
  - [OLS](https://gaow.github.io/statgen-prerequisites/ordinary_least_squares.html)
  - [Summary Statistics](https://gaow.github.io/statgen-prerequisites/summary_statistics.html)
  - [Random Effect](https://gaow.github.io/statgen-prerequisites/random_effect.html)

In [15]:
# Clear the environment
rm(list = ls())
set.seed(12)
# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)
# Reshape into a matrix
N = 5 # number of individuals
M = 3 # number of variants
geno_matrix <- matrix(genotypes, nrow=N, ncol=M, byrow=TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive / dominant / recessive model
Xraw_additive <- matrix(0, nrow=N, ncol=M) # dount number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}
X <- scale(Xraw_additive, center=TRUE, scale=TRUE)

Let's assume that the first variant is the true causal variant and its $\beta$ comes from a normal distribution $N(0,1)$, while the error term comes from a distribution $N(0,0.3)$, then the observed trait value would be:

In [16]:
beta_1 <- rnorm(1, mean = 0, sd = 1)
epsilon <- rnorm(N, mean = 0, sd = 0.3)
Y <- X[, 1] * beta_1 + epsilon


Then let's fit the OLS model:

In [17]:
# OLS summary statistics for all variants
p_values <- numeric(M)
betas <- numeric(M)
pve <- numeric(M)
for (j in 1:M) {
  SNP <- X[, j]
  model <- lm(Y ~ SNP)
  summary_model <- summary(model)
  
  betas[j] <- summary_model$coefficients[2, 1]
  p_values[j] <- summary_model$coefficients[2, 4]
  pve[j] <- summary_model$r.squared
}

# Create summary table
OLS_results <- data.frame(Variant = colnames(X), Beta = betas, P_Value = p_values, PVE = pve)
OLS_results

Variant,Beta,P_Value,PVE
<chr>,<dbl>,<dbl>,<dbl>
Variant 1,-1.588887,0.005496276,0.9454644
Variant 2,0.886833,0.344599346,0.2945391
Variant 3,1.04868,0.243084489,0.4118557


Now we add the PVE column in the summary statistics (though normally you would not see this column). We know about the truth that the first variant is the true causal variant and explains most of the variance -- but why does the PVE sum over 1? and why is the second and third variants also explain some of the variance?

What if we fit a joint model?

In [18]:
# The previous code calculated marginal effects (GWAS-style, one SNP at a time)
# Now let's calculate joint effects by including all variants in one model

# Multiple regression model including all variants simultaneously
joint_model <- lm(Y ~ X)
joint_summary <- summary(joint_model)
joint_summary$r.squared # Overall model R-squared

The joint model suggests that the three variants all together explain 95.87% of the total variance.

# Supplementary

- Zhu H, Zhou X. Statistical methods for SNP heritability estimation and partition: A review. Comput Struct Biotechnol J. 2020 Jun 18;18:1557-1568. doi: 10.1016/j.csbj.2020.06.011. PMID: 32637052; PMCID: PMC7330487.