# Intuition



linear mixed model in statistical genetics

# Notations


A **Linear Mixed Model (LMM)** is a powerful tool used in statistical genetics to assess associations between genetic variants and traits while accounting for the relatedness between individuals. LMMs combine fixed effects, such as the effect of a genetic variant on a trait, with random effects that model the correlation structure of the data, typically due to family relationships or population structure.

### **Model Definition**

In the context of genetic data, the linear mixed model can be written as:

$$
y_i = X_i \beta + Z_i u_i + \epsilon_i
$$

Where:

- $y_i$ is the observed trait for individual $i$.
- $X_i$ is the genotype for individual $i$ (which could be a vector of genotype values for all variants).
- $\beta$ is the fixed effect vector that represents the marginal effects of the genetic variants (the effect size).
- $Z_i$ is the design matrix for the random effects (representing the genetic relationships between individuals, often given by a kinship matrix).
- $u_i$ is the random effect vector, accounting for familial relatedness or population structure.
- $\epsilon_i$ is the residual error term for individual $i$, which is assumed to be normally distributed with mean 0 and variance $\sigma^2$.

### **Assumptions**

The assumptions underlying the linear mixed model are as follows:
1. The trait values $y_i$ are assumed to be normally distributed.
2. The genotype matrix $X$ is standardized, i.e., centered and scaled (optional but commonly done).
3. The random effects $u_i$ are assumed to be normally distributed with mean 0 and covariance matrix $\Sigma_u$.
4. The residuals $\epsilon_i$ are assumed to be independent and normally distributed with mean 0 and variance $\sigma^2$.


### **Conclusion**

The linear mixed model is an essential tool in statistical genetics, allowing for the estimation of genetic associations while accounting for population structure and relatedness between individuals. It provides more accurate effect size estimates and p-values than simple linear models, especially in the context of complex traits influenced by genetic and environmental factors.

# Example

In [14]:
# Load necessary libraries
library(lme4)

# Simulate true mean and effect size
set.seed(1)
baseline <- 170  # Population mean of the trait (e.g., height in cm) when the genetic variant has no effect (Model 1)
theta_true <- 2  # True effect size of the genetic variant. This represents the change in height (in cm) associated with each additional minor allele (Model 2)
sd_y <- 1  # Standard deviation of the trait (e.g., variability in height measurement within the population)

# Simulate genotype and height values for more individuals
n_individuals <- 3  # Number of individuals
n_measurements <- 5  # Number of measurements per individual

# Generate genotype values (randomly assign each individual to one of 3 genotypes)
genotype <- sample(c(0, 1, 2), size = n_individuals * n_measurements, replace = TRUE)

# Generate individual IDs for 10 individuals, each having 5 measurements
individuals <- factor(rep(1:n_individuals, each = n_measurements))

# Simulate height values for each individual, based on genotypes
height_values <- rnorm(length(genotype), mean = baseline + theta_true * genotype, sd = sd_y)

# Create data frame with the simulated data
data <- data.frame(individual = individuals, genotype = genotype, height = height_values)

data

individual,genotype,height
<fct>,<dbl>,<dbl>
1,0,170.5758
1,2,173.6946
1,0,171.5118
1,1,172.3898
1,0,169.3788
2,2,171.7853
2,2,175.1249
2,1,171.9551
2,1,171.9838
2,2,174.9438


In [15]:
# Fit a Linear Mixed Model (LMM) with random intercept for individuals
model <- lmer(height ~ genotype + (1 | individual), data = data)

# Summary of the model
summary(model)


boundary (singular) fit: see help('isSingular')



Linear mixed model fit by REML ['lmerMod']
Formula: height ~ genotype + (1 | individual)
   Data: data

REML criterion at convergence: 39.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.45865 -0.33622  0.03095  0.65098  1.24730 

Random effects:
 Groups     Name        Variance Std.Dev.
 individual (Intercept) 0.0000   0.0000  
 Residual               0.8121   0.9012  
Number of obs: 15, groups:  individual, 3

Fixed effects:
            Estimate Std. Error t value
(Intercept) 170.5660     0.3447 494.778
genotype      1.7175     0.2725   6.302

Correlation of Fixed Effects:
         (Intr)
genotype -0.738
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')


# TODO
- [ ] should we move this right after summary statistics?