# Random Effect

A random effect represents uncertainty about an effect by modeling it as a value drawn from a distribution, rather than treating it as a fixed, unknown constant.

# Graphical Summary

![random effect](./graphical_summary/Slide14.png)

# Key Formula

Under the single marker linear regression, instead of the fixed value of $\beta$ as we discussed in [OLS](https://gaow.github.io/statgen-prerequisites/ordinary_least_squares.html), if we view $\beta$ as a random effect,

$$
\mathbf{Y}=\mathbf{X}\beta+\boldsymbol{\epsilon}, \beta \sim N(\beta_0, \sigma_0^2)
$$

- $\mathbf{Y}$ is the $N \times 1$ vector of trait values for $N$ individuals
- $\mathbf{X}$ is the $N \times 1$ vector of the genotype vector for a single variant across $N$ individuals
- $\beta$ is the random effect that comes from the distribution $N(\beta_0, \sigma_0^2)$
- $\epsilon$ is the $N \times 1$ vector of error terms for $N$ individuals and $\epsilon \sim N(0, \sigma^2)$

# Technical Details

techinical details here

# Example

In this example we assume that the first genetic variant is the true causal variant and its true effect comes from the distribution $N(0,1)$, and then we simulate the trait value. Then we use OLS to fit the model and obtain the summary statistics.

- Requirement: 
  - [OLS](https://gaow.github.io/statgen-prerequisites/ordinary_least_squares.html)
  - [Summary Statistics](https://gaow.github.io/statgen-prerequisites/summary_statistics.html)

In [20]:
# Clear the environment
rm(list = ls())
set.seed(12)
# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)
# Reshape into a matrix
N = 5 # number of individuals
M = 3 # number of variants
geno_matrix <- matrix(genotypes, nrow=N, ncol=M, byrow=TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive / dominant / recessive model
Xraw_additive <- matrix(0, nrow=N, ncol=M) # dount number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}
X <- scale(Xraw_additive, center=TRUE, scale=TRUE)

Let's assume that the first variant is the true causal variant and its $\beta$ comes from a normal distribution $N(0,1)$, while the error term comes from a distribution $N(0,0.3)$, then the observed trait value would be:

In [21]:
beta_1 <- rnorm(1, mean = 0, sd = 1)
epsilon <- rnorm(N, mean = 0, sd = 0.3)
Y <- X[, 1] * beta_1 + epsilon


Then let's fit the OLS model:

In [22]:
# OLS summary statistics for all variants
p_values <- numeric(M)
betas <- numeric(M)

for (j in 1:M) {
  SNP <- X[, j]
  model <- lm(Y ~ SNP)
  summary_model <- summary(model)
  
  betas[j] <- summary_model$coefficients[2, 1]
  p_values[j] <- summary_model$coefficients[2, 4]
}

# Create summary table
OLS_results <- data.frame(Variant = colnames(X), Beta = betas, P_Value = p_values)
OLS_results

Variant,Beta,P_Value
<chr>,<dbl>,<dbl>
Variant 1,-1.588887,0.005496276
Variant 2,0.886833,0.344599346
Variant 3,1.04868,0.243084489
