# Intuition

Single marker linear regression in statistical genetics


# Notations

## Random Effect Model Specification

In the random effect model, we consider the effect size $\theta$ as a random variable drawn from a prior distribution:

$$\theta \sim N(\theta_0, \sigma_0^2)$$

The model becomes:

$$y_i = \mu + \theta x_{i,j} + \epsilon_i$$

Where:
- $\theta$ is a random variable
- $\epsilon_i \sim N(0, \sigma^2)$

### Bayesian Interpretation

From a Bayesian perspective, we start with a prior on the effect size:

$$p(\theta) = N(\theta_0, \sigma_0^2)$$

After observing data $D = \{(x_1,y_1), (x_2, y_2), \dots, (x_N, y_N)\}$, we update to the posterior:

$$p(\theta|D) = N(\theta_1, \sigma_1^2)$$

Where:

$$\theta_1 = \frac{\tau_0 \theta_0 + \sum_{i=1}^N x_{i,j}(y_i - \mu)}{\tau_0 + \sum_{i=1}^N x_{i,j}^2}$$

$$\tau_1 = \tau_0 + \frac{\sum_{i=1}^N x_{i,j}^2}{\sigma^2}$$

$$\sigma_1^2 = \frac{1}{\tau_1}$$

Here, $\tau_0 = 1/\sigma_0^2$ is the prior precision, and $\tau_1 = 1/\sigma_1^2$ is the posterior precision.

### Random Effect Parameter Estimation

The Best Linear Unbiased Predictor (BLUP) for the random effect is:

$$\tilde{\theta} = E[\theta|D] = \theta_1$$

This can be rewritten as:

$$\tilde{\theta} = \frac{\sigma_0^2 \sum_{i=1}^N x_{i,j}(y_i - \mu)}{\sigma_0^2 \sum_{i=1}^N x_{i,j}^2 + \sigma^2}$$

Or in terms of the OLS estimator:

$$\tilde{\theta} = \frac{\sigma_0^2 \sum_{i=1}^N x_{i,j}^2}{\sigma_0^2 \sum_{i=1}^N x_{i,j}^2 + \sigma^2} \hat{\theta}$$

This shows that the BLUP is a shrunken version of the OLS estimator, with the shrinkage factor depending on the relative magnitudes of $\sigma_0^2$ and $\sigma^2$.

### Variance Component Estimation

The variance components $\sigma_0^2$ and $\sigma^2$ are typically estimated using maximum likelihood (ML) or restricted maximum likelihood (REML):

$$L(\sigma_0^2, \sigma^2|D) = \int p(D|\theta, \sigma^2)p(\theta|\sigma_0^2)d\theta$$

This can be maximized numerically to obtain estimates $\hat{\sigma}_0^2$ and $\hat{\sigma}^2$.

### Comparison between Fixed and Random Effect Models

| Aspect | Fixed Effect (OLS) | Random Effect |
|--------|-------------------|---------------|
| Effect size interpretation | Fixed, unknown parameter | Random variable with a distribution |
| Estimation method | Ordinary Least Squares | Best Linear Unbiased Prediction (BLUP) |
| Shrinkage | No shrinkage | Shrinks estimates toward prior mean |
| Multiple testing | Requires correction | Can be integrated in the model |
| Power | Higher for large effects | Higher for small effects due to borrowing strength |
| Computational complexity | Lower | Higher |

## Percentage of Variance Explained in Random Effect Models

In the context of a single marker random effect model, the percentage of variance explained (PVE) relates to how much of the phenotypic variance is accounted for by the genetic variant. 

Recall the random effect model for a single genetic variant $j$:

$$Y = \theta_b + \theta X_{\cdot,j} + \epsilon$$

Where $\theta \sim N(\theta_0, \sigma^2_\theta)$ is treated as a random effect.

The percentage of variance explained by this variant can be calculated as:

$$\text{PVE}_j = \frac{\text{Var}(\theta X_{\cdot,j})}{\text{Var}(Y)}$$

Given that $\text{Var}(Y) = \text{Var}(\theta X_{\cdot,j}) + \sigma^2_\epsilon$, where $\sigma^2_\epsilon$ is the residual variance, we can express PVE as:

$$\text{PVE}_j = \frac{\text{Var}(\theta X_{\cdot,j})}{\text{Var}(\theta X_{\cdot,j}) + \sigma^2_\epsilon}$$

For a genetic variant with random effect, the variance explained is:

$$\text{Var}(\theta X_{\cdot,j}) = \sigma^2_\theta \text{Var}(X_{\cdot,j})$$

For a bi-allelic variant with minor allele frequency $f_j$, assuming Hardy-Weinberg equilibrium:

$$\text{Var}(X_{\cdot,j}) = 2f_j(1-f_j)$$

Therefore:

$$\text{PVE}_j = \frac{\sigma^2_\theta \times 2f_j(1-f_j)}{\sigma^2_\theta \times 2f_j(1-f_j) + \sigma^2_\epsilon}$$



## Heritability

### Definition of Heritability

Heritability is a fundamental concept in genetics that quantifies the proportion of phenotypic variance attributable to genetic factors. There are two main types of heritability:

1. **Broad-sense heritability ($H^2$)**: The proportion of phenotypic variance that is due to all genetic factors, including additive, dominance, and epistatic effects.

$$H^2 = \frac{\sigma^2_G}{\sigma^2_P}$$

2. **Narrow-sense heritability ($h^2$)**: The proportion of phenotypic variance that is due only to additive genetic effects, which are the effects transmitted from parents to offspring.

$$h^2 = \frac{\sigma^2_A}{\sigma^2_P}$$

Where:
- $\sigma^2_G$ is the total genetic variance
- $\sigma^2_A$ is the additive genetic variance
- $\sigma^2_P$ is the total phenotypic variance

Narrow-sense heritability is particularly important in breeding and quantitative genetics as it represents the proportion of phenotypic variance that responds to selection.



### Connection between PVE and Heritability

When we consider multiple genetic variants in a random effect model, the total percentage of variance explained across all variants relates to heritability. If we assume that all genetic effects are captured by the variants in our model, then:

$$h^2 \approx \sum_{j=1}^{J} \text{PVE}_j$$

For a polygenic random effect model where all genetic variants ($J$) are included:

$$Y = \theta_b + \sum_{j=1}^{J} \theta_j X_{\cdot,j} + \epsilon$$

With $\theta_j \sim N(0, \sigma^2_\theta)$, the total genetic variance is:

$$\sigma^2_G = \sum_{j=1}^{J} \sigma^2_\theta \times 2f_j(1-f_j)$$

And the narrow-sense heritability becomes:

$$h^2 = \frac{\sigma^2_G}{\sigma^2_G + \sigma^2_\epsilon}$$

This is equivalent to the total percentage of variance explained by all variants.


## Estimating Heritability Using Random Effect Models

In practice, heritability can be estimated using mixed linear models that incorporate both fixed and random effects:

1. **Variance Component Methods**: Estimate the variance components ($\sigma^2_G$ and $\sigma^2_\epsilon$) using restricted maximum likelihood (REML) or Bayesian methods.

2. **Genomic REML (GREML)**: Uses a genomic relationship matrix (GRM) to estimate heritability from genome-wide SNP data.

3. **LD Score Regression**: Estimates heritability using summary statistics from genome-wide association studies (GWAS) by examining the relationship between test statistics and linkage disequilibrium (LD) scores.


# Example

In [20]:
rm(list=ls())
set.seed(1)
# Simulate true mean and effect size
baseline <- 170  # Population mean of the trait (e.g., height in cm) when the genetic variant has no effect (Model 1)
theta_true <- 2  # True effect size of the genetic variant. This represents the change in height (in cm) associated with each additional minor allele (Model 2)
sd_y <- 1  # Standard deviation of the trait (e.g., variability in height measurement within the population)

# Simulate genotype and height values
genotype <- c(1, 2, 0)

# Simulate height values for three individuals based on genotypes
n = length(genotype)
height_values <- rnorm(n, mean = baseline + theta_true * genotype, sd = sd_y)
data <- data.frame(genotype = genotype, height = height_values)
data

genotype,height
<dbl>,<dbl>
1,171.3735
2,174.1836
0,169.1644


In [21]:
# Normalize genotype data (X)
X_normalized <- scale(data$genotype)

# Normalize height data (Y)
Y_normalized <- scale(data$height)

# Update the design matrix with normalized data
X <- cbind(1, X_normalized)  # Design matrix [1, X_.,j]
Y <- Y_normalized             # Trait vector

# Sample means of normalized data (should be 0 for normalized data)
mean_X <- mean(X_normalized)
mean_Y <- mean(Y_normalized)


In [22]:

# Parameter estimation using matrix form
theta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y
theta_b_hat <- theta_hat[1]  # Intercept estimate
theta_hat <- theta_hat[2]    # Effect size estimate

# Compute residuals
residuals <- Y - (theta_b_hat + theta_hat * X_normalized)

# Estimate residual variance
sigma_squared_hat <- sum(residuals^2) / (n - 2)
if (n <= 2) {
  sigma_squared_hat <- var(residuals)  # Use this if n <= 2
}

# Calculate variance of theta_hat
var_theta_hat <- sigma_squared_hat / sum((X_normalized - mean_X)^2)
if (sum((X_normalized - mean_X)^2) == 0) {
  var_theta_hat <- NA  # Handle case where all genotypes are identical
}
# Log-likelihood for Model 1 (no genetic effect)
theta_b_hat_M1 <- mean_Y
residuals_M1 <- Y - theta_b_hat_M1
sigma_squared_hat_M1 <- sum(residuals_M1^2) / (n - 1)
l_M1 <- -n/2 * log(2 * pi * sigma_squared_hat_M1) - sum(residuals_M1^2)/(2 * sigma_squared_hat_M1)

# Log-likelihood for Model 2 (with genetic effect)
sigma_squared_hat_M2 <- sigma_squared_hat
l_M2 <- -n/2 * log(2 * pi * sigma_squared_hat_M2) - sum(residuals^2)/(2 * sigma_squared_hat_M2)

# Likelihood ratio test
lr_test <- -2 * (l_M1 - l_M2)
lr_p_value <- pchisq(lr_test, df = 1, lower.tail = FALSE)

# Print results
cat("Model Comparison:\n")
cat("Log-likelihood (M1, no genetic effect):", l_M1, "\n")
cat("Log-likelihood (M2, with genetic effect):", l_M2, "\n")
cat("Likelihood Ratio Test Statistic:", lr_test, "\n")
cat("p-value (LRT):", lr_p_value, "\n\n")


Model Comparison:
Log-likelihood (M1, no genetic effect): -3.756816 
Log-likelihood (M2, with genetic effect): 3.726255 
Likelihood Ratio Test Statistic: 14.96614 
p-value (LRT): 0.0001094577 



In [23]:
# Calculate Percentage of Variance Explained (PVE)
# ------------------------------------------------

# For the fixed effect model (OLS)
# PVE is equivalent to R^2 in this case
# Total sum of squares
SS_total <- sum((Y - mean(Y))^2)
# Residual sum of squares
SS_residual <- sum(residuals^2)
# Model sum of squares (explained variance)
SS_model <- SS_total - SS_residual
# Calculate R^2 (coefficient of determination)
R_squared <- SS_model / SS_total

# Calculate the PVE directly from the effect size for a single variant
# For normalized data, the variance of Y is 1
variance_explained_normalized <- (theta_hat^2 * var(X_normalized)) / var(Y_normalized)

# Going back to original scale
# Variance of Y in original scale
var_Y_original <- var(data$height)
# Variance of X in original scale
var_X_original <- var(data$genotype)
# Convert effect size back to original scale
theta_hat_original <- theta_hat * (sd(data$height) / sd(data$genotype))
# Calculate variance explained in original scale
variance_explained_original <- (theta_hat_original^2 * var_X_original) / var_Y_original

cat("\n--- Percentage of Variance Explained (PVE) ---\n")
cat("R-squared (normalized data):", R_squared, "\n")
cat("PVE (normalized data):", variance_explained_normalized, "\n")
cat("PVE (original scale):", variance_explained_original, "\n\n")


--- Percentage of Variance Explained (PVE) ---
R-squared (normalized data): 0.9952449 
PVE (normalized data): 0.9952449 
PVE (original scale): 0.9952449 



In [25]:
# For random effect interpretation
# -------------------------------

# Minor Allele Frequency (MAF) calculation
f_j <- mean(data$genotype) / 2  # Assuming genotype is coded as 0, 1, 2
# Expected genotype variance under Hardy-Weinberg Equilibrium
var_X_HWE <- 2 * f_j * (1 - f_j)
# Observed genotype variance
var_X_observed <- var(data$genotype)

# For a random effect model where θ ~ N(0, σ²_θ)
# We can estimate σ²_θ from the fixed effect estimate
sigma_squared_theta <- (theta_hat_original^2) / var_X_observed

# Calculate heritability for a single variant model
# Heritability is the proportion of phenotypic variance explained by genetic factors
# For a single variant, this is:
h2_single_variant <- variance_explained_original

# In a polygenic model, heritability would be the sum of the PVE across all variants
# h² = σ²_G / (σ²_G + σ²_ε)
# where σ²_G is the genetic variance and σ²_ε is the environmental variance
# For a single variant:
genetic_variance <- sigma_squared_theta * var_X_observed
environmental_variance <- var_Y_original - genetic_variance
h2_estimated <- genetic_variance / (genetic_variance + environmental_variance)

cat("--- Heritability Estimation ---\n")
cat("Minor Allele Frequency (MAF):", f_j, "\n")
cat("Genotype variance (observed):", var_X_observed, "\n")
cat("Genotype variance (HWE expected):", var_X_HWE, "\n")
cat("Estimated effect variance (σ²_θ):", sigma_squared_theta, "\n")
cat("Genetic variance (σ²_G for this variant):", genetic_variance, "\n")
cat("Environmental variance (σ²_ε):", environmental_variance, "\n")
cat("Heritability from single variant (h²):", h2_estimated, "\n\n")


--- Heritability Estimation ---
Minor Allele Frequency (MAF): 0.5 
Genotype variance (observed): 1 
Genotype variance (HWE expected): 0.5 
Estimated effect variance (σ²_θ): 6.298273 
Genetic variance (σ²_G for this variant): 6.298273 
Environmental variance (σ²_ε): 0.0300923 
Heritability from single variant (h²): 0.9952449 

