# Proportion of Variance Explained and Heritability

Proportion of variance explained (PVE) measures how much of the total variation in a trait (like height or disease risk) can be attributed to specific variables in your statistical model (e.g., genetic variants). Heritability is a specific application of this concept that measures how much of the variation in a trait across a population can be explained by genetic differences.

# Graphical Summary

![PVE](./cartoons/proportion_of_variance_explained.svg)

# Key Formula

Any phenotype can be modeled as the sum of genetic and environmental effects, i.e., $\text{Phenotype}~(Y) = \text{Genotype}~(G) + \text{Environment}~(E)$, and under the assumption that G and E are independent from each other, the **proportion of variance explained (PVE)** by genetic effect alone (also called  broad-sense heritability $H^2$) can be derived as 

$$
\text{PVE} = H^2 = \frac{\text{Var}(G)}{\text{Var}(Y)}
$$

where:
- $\text{Var}(G)$ is the genetic variance component
- $\text{Var}(E)$ is the environmental variance component

# Technical Details

## Components of Variance

Any phenotype can be modeled as the sum of genetic and environmental effects:

$$\text{Phenotype}~(P) = \text{Genotype}~(G) + \text{Environment}~(E)$$

The phenotypic variance in the trait can then be partitioned as:

$$\text{Var}(P) = \text{Var}(G) + \text{Var}(E) + 2\text{Cov}(G,E)$$

Where:
- $\text{Var}(G)$ is the genetic variance component
- $\text{Var}(E)$ is the environmental variance component
- $\text{Cov}(G,E)$ is the covariance between genetic and environmental effects

## Broad-sense Heritability

In controlled experimental settings, we can design studies where $\text{Cov}(G,E)$ is minimized and effectively set to zero. In such cases, heritability is defined as the proportion of phenotypic variance attributable to all genetic effects:

$$H^2 = \frac{\text{Var}(G)}{\text{Var}(P)}$$

This represents the proportion of phenotypic variance attributable to genetic variance.

## Narrow-sense Heritability

**Narrow-sense heritability** ($h^2$): The proportion attributable to only **additive** genetic effects:
$$h^2 = \frac{\text{Var}(A)}{\text{Var}(P)}$$

Where $\text{Var}(A)$ is the additive genetic variance, a component of $\text{Var}(G)$. 

Other components of $\text{Var}(G)$ includes $\text{Var}(D)$ (dominance variance) and $\text{Var}(I)$ (epistatic variance, i.e., gene-gene interaction)


# Example

This example calculates the PVE for each genetic variant in a mixed-effects model framework. 

For each variant, it fits a linear mixed model with the variant as a fixed effect and family structure as a random effect, then partitions the total phenotypic variance into three components: 
- fixed effect variance (from the specific genetic variant)
- family variance (from shared genetic background)
- residual variance (unexplained variation)

The code extracts variance components from the model, calculates the fixed effect variance by multiplying the variant data by its effect size, and then determines what percentage each component contributes to the total variance. 

We also compute the PVE of all variants in the joint model.


In [12]:
# Clear the environment
rm(list = ls())

# Load required packages
library(lme4)  # For mixed-effect models

# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)
# Reshape into a matrix
N = 5 # number of individuals
M = 3 # number of variants
geno_matrix <- matrix(genotypes, nrow=N, ncol=M, byrow=TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive / dominant / recessive model
Xraw_additive <- matrix(0, nrow=N, ncol=M) # dount number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}
X <- scale(Xraw_additive, center=TRUE, scale=TRUE)

# assign observed height for the 5 individuals
Y_raw <- c(180, 160, 158, 155, 193)
Y <- scale(Y_raw)

# Add family information (individuals 1,2,3 are from family 1, and 4,5 from family 2)
family_info <- c(1, 1, 1, 2, 2)

# Create a data frame for analysis with scaled genotypes
genetic_data <- data.frame(
  height = Y,
  family = family_info,
  individual = 1:N,
  X  # This directly adds all columns of X to the dataframe
)

# Rename the genotype columns for clarity
colnames(genetic_data)[4:(3+M)] <- paste0("variant", 1:M)

genetic_data

Unnamed: 0_level_0,height,family,individual,variant1,variant2,variant3
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
Individual 1,0.6528093,1,1,-0.6708204,0.2390457,0.4472136
Individual 2,-0.5560968,1,2,1.5652476,-0.9561829,-0.6708204
Individual 3,-0.6769875,1,3,0.4472136,0.2390457,-0.6708204
Individual 4,-0.8583234,2,4,-0.6708204,-0.9561829,-0.6708204
Individual 5,1.4385984,2,5,-0.6708204,1.4342743,1.5652476


In [13]:
# Single variant analysis and PVE results
pve_results_single_variants <- data.frame(
  Variant = character(),
  PVE_Fixed = numeric(),
  PVE_Family = numeric(),
  PVE_Residual = numeric(),
  Fixed_Variance = numeric(),
  Family_Variance = numeric(),
  Residual_Variance = numeric(),
  Total_Variance = numeric(),
  stringsAsFactors = FALSE
)

# Calculate PVE for each variant individually
for (j in 1:M) {
  variant_col <- paste0("variant", j)
  formula <- as.formula(paste("height ~", variant_col, "+ (1|family)"))
  
  # Fit random effect model
  model <- lmer(formula, data = genetic_data)
  
  # Extract variance components
  vc <- VarCorr(model)
  family_variance <- as.numeric(vc$family)
  residual_variance <- attr(vc, "sc")^2
  
  # Calculate fixed effect variance
  beta <- fixef(model)[variant_col]
  X_var <- matrix(genetic_data[[variant_col]])
  fixed_variance <- var(as.vector(X_var * beta))
  
  # Calculate total variance
  total_variance <- fixed_variance + family_variance + residual_variance
  
  # Calculate PVE for each component
  pve_fixed <- fixed_variance / total_variance
  pve_family <- family_variance / total_variance
  pve_residual <- residual_variance / total_variance
  
  # Store results in the data frame
  pve_results_single_variants <- rbind(pve_results_single_variants, data.frame(
    Variant = variant_col,
    PVE_Fixed = pve_fixed * 100,
    PVE_Family = pve_family * 100,
    PVE_Residual = pve_residual * 100,
    Fixed_Variance = fixed_variance,
    Family_Variance = family_variance,
    Residual_Variance = residual_variance,
    Total_Variance = total_variance,
    stringsAsFactors = FALSE
  ))
}

pve_results_single_variants

boundary (singular) fit: see help('isSingular')

boundary (singular) fit: see help('isSingular')



Variant,PVE_Fixed,PVE_Family,PVE_Residual,Fixed_Variance,Family_Variance,Residual_Variance,Total_Variance
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
variant1,20.00779,0.0,79.9922054,0.2500913,0.0,0.999878212,1.24997
variant2,66.60926,0.0,33.3907386,0.7267603,0.0,0.364319641,1.09108
variant3,95.70867,3.655575,0.6357522,1.1000658,0.04201681,0.007307271,1.14939


In [14]:
# Calculate PVE for all variants together
# Fit the full model with all variants
full_formula <- as.formula("height ~ variant1 + variant2 + variant3 + (1|family)")
full_model <- lmer(full_formula, data = genetic_data)

# Extract variance components
vc_full <- VarCorr(full_model)
family_variance_full <- as.numeric(vc_full$family)
residual_variance_full <- attr(vc_full, "sc")^2

# Calculate fixed effect variance for the full model
betas <- fixef(full_model)[2:(M+1)]  # Skip intercept
X_design <- as.matrix(genetic_data[, 4:(3+M)])
predicted_fixed <- X_design %*% betas
fixed_variance_full <- var(as.vector(predicted_fixed))

# Calculate total variance
total_variance_full <- fixed_variance_full + family_variance_full + residual_variance_full

# Calculate PVE for each component
pve_fixed_full <- fixed_variance_full / total_variance_full
pve_family_full <- family_variance_full / total_variance_full
pve_residual_full <- residual_variance_full / total_variance_full

# Print out the results for all variants together
cat("\n---------- PVE Analysis for All Variants Together ----------\n")
cat("Fixed effects (all variants) PVE:", round(pve_fixed_full * 100, 2), "%\n")
cat("Random effect (family) PVE:", round(pve_family_full * 100, 2), "%\n")
cat("Residual PVE:", round(pve_residual_full * 100, 2), "%\n")
cat("Total:", round((pve_fixed_full + pve_family_full + pve_residual_full) * 100, 2), "%\n\n")

cat("Variance components:\n")
cat("Fixed effects variance:", round(fixed_variance_full, 4), "\n")
cat("Family variance:", round(family_variance_full, 4), "\n")
cat("Residual variance:", round(residual_variance_full, 4), "\n")
cat("Total variance:", round(total_variance_full, 4), "\n")

# Calculate heritability (proportion of variance due to family structure)
heritability <- family_variance_full / (family_variance_full + residual_variance_full)
cat("\nHeritability (proportion of variance due to family structure):", 
    round(heritability * 100, 2), "%\n")

"unable to evaluate scaled gradient"
" Hessian is numerically singular: parameters are not uniquely determined"



---------- PVE Analysis for All Variants Together ----------
Fixed effects (all variants) PVE: 92.48 %
Random effect (family) PVE: 1.41 %
Residual PVE: 6.11 %
Total: 100 %

Variance components:
Fixed effects variance: 1.0142 
Family variance: 0.0155 
Residual variance: 0.067 
Total variance: 1.0967 

Heritability (proportion of variance due to family structure): 18.77 %


# Supplementary

- Zhu H, Zhou X. Statistical methods for SNP heritability estimation and partition: A review. Comput Struct Biotechnol J. 2020 Jun 18;18:1557-1568. doi: 10.1016/j.csbj.2020.06.011. PMID: 32637052; PMCID: PMC7330487.