# Linear Mixed Model

The random effect model in statistical genetics allows us to **account for correlation between individuals** due to shared genetic background (like family relationships) by partitioning variance into **fixed effects** (specific genetic variants we're testing) and **random effects** (overall genetic similarity within groups). 

# Graphical Summary

![Fig](./graphical_summary/slides/Slide13.png)

# Key Formula

In the linear mixed model (**mixed** because it incorporate both the **fixed** and the **random** effects), which accurately represent non-independent data structures,

$$
\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{g} + \boldsymbol{\epsilon}
$$

where:
- $\mathbf{Y}$ is the $N \times 1$ vector of phenotypes
- $\mathbf{X}$ is the $N \times M$ design matrix for fixed effects (e.g., genotypes, covariates)
- $\boldsymbol{\beta}$ is the $M \times 1$ vector of fixed effect coefficients (unknown, to be estimated)
- $\mathbf{Z}$ is the $N \times 1$ vector for random effects
- $\boldsymbol{\epsilon}$ is the $N \times 1$ vector of residual errors, where $\boldsymbol{\epsilon} \sim N(0, \sigma^2_e\mathbf{I})$

# Technical Details

## Random Effects Decomposition: $\mathbf{g}$ and $\mathbf{Zu}$

The random effects term **g** in the linear mixed model can be decomposed as:

$$
\mathbf{g} = \mathbf{Z} \mathbf{u}
$$

where:
- $\mathbf{Z}$ is the $N \times M$ design matrix for random effects
- $\mathbf{u}$ is the $M \times 1$ vector of random effects, where $\mathbf{u} \sim N(0, \sigma^2_u\mathbf{G})$
- $\mathbf{G}$ is the relationship matrix between the $N$ individuals

This decomposition allows us to model the covariance structure of the random effects through the relationship matrix **G**, which captures dependencies between individuals (e.g., genetic relatedness, population structure).

## Variance-Covariance Structure

The full variance-covariance matrix of the phenotype **Y** becomes:

$$
\text{Var}(\mathbf{Y}) = \mathbf{Z}\mathbf{G}\mathbf{Z}^T\sigma^2_u + \mathbf{I}\sigma^2_e
$$

This structure accounts for:
- **Genetic relatedness** through $\mathbf{Z}\mathbf{G}\mathbf{Z}^T\sigma^2_u$
- **Independent residual variation** through $\mathbf{I}\sigma^2_e$


## Popular LMM Methods in Statistical Genetics

| Method | Purpose | Key Innovation | Scale/Application |
|--------|---------|----------------|-------------------|
| **GCTA** | Heritability estimation and association testing | Uses genome-wide SNPs to construct genetic relationship matrix (GRM) | $h^2 = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e}$ estimation |
| **GEMMA** | Fast genome-wide association studies with population structure control | Efficient algorithms for large-scale GWAS with kinship correction | Computationally efficient for biobank-scale data |
| **BOLT-LMM** | Ultra-fast mixed model association testing | Bayesian sparse linear mixed model approach | Hundreds of thousands of individuals |
| **SAIGE** | Association testing for binary and quantitative traits | Handles unbalanced case-control studies efficiently | Saddlepoint approximation for computational efficiency |
| **REGENIE** | Whole genome regression with prediction | Two-step procedure combining whole-genome regression with mixed models | Robust to population stratification and relatedness |

# Example

This example demonstrates how the data is generated under a linear mixed model framework, which extends beyond simple linear regression by incorporating both fixed effects (specific SNP associations) and random effects (polygenic background). The linear mixed model accounts for genetic relationships between individuals through a Genetic Relationship Matrix (GRM), modeling the correlation structure that arises from shared ancestry and population structure. We show how to simulate phenotypes that include both direct genetic effects from specific variants and polygenic contributions from genome-wide background effects, illustrating the key components that make LMMs essential for modern statistical genetics analyses.

Related topics:
- [OLS](https://gaow.github.io/statgen-prerequisites/ordinary_least_squares.html)
- [GRM](https://gaow.github.io/statgen-prerequisites/genetic_relationship_matrix.html)
- [random effect](https://gaow.github.io/statgen-prerequisites/random_effect.html)
- [PVE](https://gaow.github.io/statgen-prerequisites/proportion_of_variance_explained.html)

In [13]:
# Clear the environment
rm(list = ls())
set.seed(13)
library(MASS) # For mvrnorm function
# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)

# Reshape into a matrix
N = 5
M = 3
geno_matrix <- matrix(genotypes, nrow = N, ncol = M, byrow = TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive model
Xraw_additive <- matrix(0, nrow = N, ncol = M) # count number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}

X <- scale(Xraw_additive, center = TRUE, scale = TRUE)

We generate the data according to a linear mixed model:

In [14]:
# Linear Mixed Model: Y = X*beta + g + epsilon
# where g = Z*u represents random polygenic effects

# Fixed effect for variant 1
beta <- 0.8  # Fixed effect size for variant 1

# Create Genetic Relationship Matrix (GRM) using all variants
# GRM = (1/M) * X_raw * X_raw^T, where X_raw is the standardized genotype matrix
GRM <- (1/M) * X %*% t(X)
cat("Genetic Relationship Matrix (GRM):\n")
GRM

# Generate random polygenic effects: g ~ N(0, sigma_u^2 * GRM)
sigma_u <- 0.5  # Standard deviation of random effects
g <- mvrnorm(n = 1, mu = rep(0, N), Sigma = sigma_u^2 * GRM)
g <- as.vector(g)

# Residual errors
sigma_e <- 0.3  # Standard deviation of residual errors
epsilon <- rnorm(N, mean = 0, sd = sigma_e)

# Generate phenotype with both fixed and random effects
Y <- X[, 1] * beta + g + epsilon


Genetic Relationship Matrix (GRM):


Unnamed: 0,Individual 1,Individual 2,Individual 3,Individual 4,Individual 5
Individual 1,0.23571429,-0.5261905,-0.18095238,-0.02619048,0.497619
Individual 2,-0.52619048,1.2714286,0.30714286,0.1047619,-1.1571429
Individual 3,-0.18095238,0.3071429,0.23571429,-0.02619048,-0.3357143
Individual 4,-0.02619048,0.1047619,-0.02619048,0.6047619,-0.6571429
Individual 5,0.49761905,-1.1571429,-0.33571429,-0.65714286,1.652381


We can estimate the PVE again by each component:

In [15]:
# In LMM, total genetic variance = fixed effects variance + random effects variance
# Fixed genetic component from variant 1
G_fixed <- X[, 1] * beta

# Random genetic component (polygenic background)
G_random <- g

# Total genetic component
G_total <- G_fixed + G_random

# Calculate variance components
var_G_fixed <- var(G_fixed)
var_G_random <- var(G_random)
var_G_total <- var(G_total)
var_Y <- var(Y)

# PVE calculations
PVE_fixed <- var_G_fixed / var_Y           # PVE from fixed effects only
PVE_random <- var_G_random / var_Y         # PVE from random effects only  
PVE_total <- var_G_total / var_Y           # Total PVE

cat("PVE Components:\n")
cat("PVE_fixed (variant 1) =", round(PVE_fixed, 4), "\n")
cat("PVE_random (polygenic) =", round(PVE_random, 4), "\n")
cat("PVE_total =", round(PVE_total, 4), "\n\n")

PVE Components:
PVE_fixed (variant 1) = 0.8187 
PVE_random (polygenic) = 0.1212 
PVE_total = 0.67 



Note that the same framework applies when the genetic effect $\beta$ is modeled as a random effect rather than fixed.