# Intuition



linear mixed model in statistical genetics

# Notations


A **Linear Mixed Model (LMM)** is a powerful tool used in statistical genetics to assess associations between genetic variants and traits while accounting for the relatedness between individuals. LMMs combine fixed effects, such as the effect of a genetic variant on a trait, with random effects that model the correlation structure of the data, typically due to family relationships or population structure.

### **Model Definition**

In the context of genetic data, the linear mixed model can be written as:

$$
y_i = X_i \beta + Z_i u_i + \epsilon_i
$$

Where:

- $y_i$ is the observed trait for individual $i$.
- $X_i$ is the genotype for individual $i$ (which could be a vector of genotype values for all variants).
- $\beta$ is the fixed effect vector that represents the marginal effects of the genetic variants (the effect size).
- $Z_i$ is the design matrix for the random effects (representing the genetic relationships between individuals, often given by a kinship matrix).
- $u_i$ is the random effect vector, accounting for familial relatedness or population structure.
- $\epsilon_i$ is the residual error term for individual $i$, which is assumed to be normally distributed with mean 0 and variance $\sigma^2$.

> TianGe's slide P63
$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\epsilon}
$$
- $\mathbf{y}$: $N \times 1$ vector of phenotypes
- $\mathbf{X}$: $N \times J$ genotypes matrix
- $\boldsymbol{\beta} \sim N(0,\sigma_\beta^2 \mathbf{I})$: $J \times 1$ vector of random effect sizes
- $\mathbf{Z}$: covariance matrix
- $\mathbf{u}$: vector of the **fixed** effects of covariates
- $\boldsymbol{\epsilon} \sim  N(0,\sigma_\epsilon^2 \mathbf{I})$: $N \times 1$ residuals due to independent environment

### **Assumptions**

The assumptions underlying the linear mixed model are as follows:
1. The trait values $y_i$ are assumed to be normally distributed.
2. The genotype matrix $X$ is standardized, i.e., centered and scaled (optional but commonly done).
3. The random effects $u_i$ are assumed to be normally distributed with mean 0 and covariance matrix $\Sigma_u$.
4. The residuals $\epsilon_i$ are assumed to be independent and normally distributed with mean 0 and variance $\sigma^2$.


### **Conclusion**

The linear mixed model is an essential tool in statistical genetics, allowing for the estimation of genetic associations while accounting for population structure and relatedness between individuals. It provides more accurate effect size estimates and p-values than simple linear models, especially in the context of complex traits influenced by genetic and environmental factors.

# Example

In [24]:
rm(list=ls())
set.seed(42)  # For reproducibility
# Number of individuals and variants
N <- 5  # Number of individuals
M <- 3  # Number of SNPs (variants)

# Create a random genotype matrix (0, 1, 2 values for each SNP)
X_raw <- matrix(sample(0:2, N * M, replace = TRUE), nrow = N, ncol = M)

# Scale the genotype matrix
X <- scale(X_raw, scale = TRUE)  # Scaling each SNP variant across individuals

# Number of measurements per individual (to simulate repeated measurements)
n_measurements <- 5
individuals <- factor(rep(1:N, each = n_measurements))  # Create the factor for repeated measures

# Simulate random intercepts for each individual
random_intercepts <- rnorm(N, mean = 0, sd = 1)  # Random intercept for each individual

# Simulate the outcome variable, with repeated measures for the same individual
y <- rep(NA, length(individuals))

for (i in 1:N) {
  # For each individual, simulate 'n_measurements' height values around the random intercept
  y[individuals == i] <- random_intercepts[i] + rnorm(n_measurements, mean = 0, sd = 1)  # Measurement error
}

# Create a data frame with individuals, genotypes, and height
# Note: X is now a matrix with M columns, we need to split it into separate columns for each SNP
data <- data.frame(individual = individuals, X)

# Add the height values to the data frame
data$height <- y

In [25]:
head(data)

Unnamed: 0_level_0,individual,X1,X2,X3,height
Unnamed: 0_level_1,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,-0.4472136,-0.2390457,-1.0954451,1.94082
2,1,-0.4472136,-0.2390457,-1.0954451,1.020617
3,1,-0.4472136,-1.4342743,0.7302967,-1.351586
4,1,-0.4472136,0.9561829,0.7302967,-1.135597
5,1,1.7888544,0.9561829,0.7302967,2.624983
6,2,-0.4472136,-0.2390457,-1.0954451,1.980007


In [26]:
# Fit a Linear Mixed Model (LMM) with random intercept for individuals
model <- lmer(height ~ X1 + X2 + X3 + (1 | individual), data = data)

# Summary of the model
summary(model)


Linear mixed model fit by REML ['lmerMod']
Formula: height ~ X1 + X2 + X3 + (1 | individual)
   Data: data

REML criterion at convergence: 86.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.51387 -0.77446  0.08211  0.62760  1.46843 

Random effects:
 Groups     Name        Variance Std.Dev.
 individual (Intercept) 2.679    1.637   
 Residual               1.282    1.132   
Number of obs: 25, groups:  individual, 5

Fixed effects:
             Estimate Std. Error t value
(Intercept)  0.004311   0.766254   0.006
X1           0.617622   0.320298   1.928
X2           0.051748   0.299611   0.173
X3          -0.242696   0.277386  -0.875

Correlation of Fixed Effects:
   (Intr) X1     X2    
X1  0.000              
X2  0.000 -0.500       
X3  0.000 -0.354  0.000