# Intuition

Single marker linear regression in statistical genetics


# Notations

## Model Specification

In the fixed effect model for one genetic variant $X_{\cdot,j}$, we can express the relationship in matrix form:

$$
\mathbf{Y} = \theta_b \mathbf{1} + \theta X_{\cdot,j} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})
$$

Where:
- $\mathbf{Y}$ is the $N \times 1$ vector of trait values for all individuals
- $\mathbf{1}$ is an $N \times 1$ vector of ones
- $X_{\cdot,j}$ is the $N \times 1$ vector of genotypes for variant $j$ across all individuals
- $\theta_b$ is the intercept term
- $\theta$ is the effect size of the genetic variant
- $\boldsymbol{\epsilon}$ is the $N \times 1$ vector of error terms
- $\mathbf{I}$ is the $N \times 1$ identity matrix

For individual $i$, this model can be written as:

$$
y_i = \theta_b + \theta x_{i,j} + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)
$$

## Parameter Estimation

Using Ordinary Least Squares (OLS), we can derive the estimators for $\theta_b$ and $\theta$ in matrix form:

Let $\mathbf{X} = [\mathbf{1} \; X_{\cdot,j}]$ be the $N \times 2$ design matrix and $\boldsymbol{\theta} = [\theta_b \; \theta]^T$ be the vector of parameters.

The OLS estimator is:

$$
\hat{\boldsymbol{\theta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}
$$

This gives us:

$$\hat{\theta}_b = \bar{Y} - \hat{\theta}\bar{X}_{\cdot,j}$$

$$\hat{\theta} = \frac{\sum_{i=1}^N (x_{i,j} - \bar{X}_{\cdot,j})(y_i - \bar{Y})}{\sum_{i=1}^N (x_{i,j} - \bar{X}_{\cdot,j})^2}$$

Where $\bar{Y}$ is the sample mean of the trait values and $\bar{X}_{\cdot,j}$ is the sample mean of the genotypes for variant $j$.

The variance-covariance matrix of the estimator is:

$$
\text{Var}(\hat{\boldsymbol{\theta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}
$$

The variance of the genetic effect estimator specifically is:

$$\text{Var}(\hat{\theta}) = \frac{\sigma^2}{\sum_{i=1}^N (x_{i,j} - \bar{X}_{\cdot,j})^2}$$

The residual variance can be estimated as:

$$\hat{\sigma}^2 = \frac{1}{N-2}(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\theta}})^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\theta}})$$

Or equivalently:

$$\hat{\sigma}^2 = \frac{1}{N-2}\sum_{i=1}^N (y_i - \hat{\theta}_b - \hat{\theta}x_{i,j})^2$$

## Hypothesis Testing

To test for association, we formulate the following hypotheses:

$H_0: \theta = 0$ (No association)  
$H_1: \theta \neq 0$ (Association exists)

## Statistical Significance

The test statistic follows a t-distribution under the null hypothesis:

$$t = \frac{\hat{\theta}}{\sqrt{\text{Var}(\hat{\theta})}} \sim t_{N-2}$$

The p-value is calculated as:

$$p\text{-value} = 2 \times \text{Pr}(t_{N-2} > |t|)$$

# Example

In [10]:
rm(list=ls())
set.seed(1)
# Simulate true mean and effect size
baseline <- 170  # Population mean of the trait (e.g., height in cm) when the genetic variant has no effect (Model 1)
theta_true <- 2  # True effect size of the genetic variant. This represents the change in height (in cm) associated with each additional minor allele (Model 2)
sd_y <- 1  # Standard deviation of the trait (e.g., variability in height measurement within the population)

# Simulate genotype and height values
genotype <- c(1, 2, 0)

# Simulate height values for three individuals based on genotypes
n = length(genotype)
height_values <- rnorm(n, mean = baseline + theta_true * genotype, sd = sd_y)
data <- data.frame(genotype = genotype, height = height_values)
data

genotype,height
<dbl>,<dbl>
1,171.3735
2,174.1836
0,169.1644


In [11]:
# Normalize genotype data (X)
X_normalized <- scale(data$genotype)

# Normalize height data (Y)
Y_normalized <- scale(data$height)

# Update the design matrix with normalized data
X <- cbind(1, X_normalized)  # Design matrix [1, X_.,j]
Y <- Y_normalized             # Trait vector

# Sample means of normalized data (should be 0 for normalized data)
mean_X <- mean(X_normalized)
mean_Y <- mean(Y_normalized)

# Parameter estimation using matrix form
theta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y
theta_b_hat <- theta_hat[1]  # Intercept estimate
theta_hat <- theta_hat[2]    # Effect size estimate

# Compute residuals
residuals <- Y - (theta_b_hat + theta_hat * X_normalized)

# Estimate residual variance
sigma_squared_hat <- sum(residuals^2) / (n - 2)
if (n <= 2) {
  sigma_squared_hat <- var(residuals)  # Use this if n <= 2
}

# Calculate variance of theta_hat
var_theta_hat <- sigma_squared_hat / sum((X_normalized - mean_X)^2)
if (sum((X_normalized - mean_X)^2) == 0) {
  var_theta_hat <- NA  # Handle case where all genotypes are identical
}

# Test statistic
t_stat <- theta_hat / sqrt(var_theta_hat)
if (is.na(t_stat)) {
  t_stat <- NA  # Handle case where variance couldn't be calculated
}

# p-value
p_value <- 2 * pt(abs(t_stat), df = n - 2, lower.tail = FALSE)
if (is.na(t_stat)) {
  p_value <- NA  # Handle case where t-stat couldn't be calculated
}

# Print results in regular summary statistics format
cat("\n--- Single Marker Linear Regression Results (Normalized Data) ---\n")
cat("\nEstimated Coefficients:\n")
cat(sprintf("Intercept (θ̂_b): %.4f\n", theta_b_hat))
cat(sprintf("Effect Size (θ̂): %.4f\n", theta_hat))

cat("\nResiduals:\n")
cat(sprintf("Residual Standard Deviation (σ̂): %.4f\n", sqrt(sigma_squared_hat)))

cat("\nHypothesis Testing:\n")
cat(sprintf("t-statistic: %.4f\n", t_stat))
cat(sprintf("p-value: %.4f\n", p_value))

cat("\nModel Fit:\n")
cat(sprintf("R-squared: %.4f\n", 1 - (sum(residuals^2) / sum((Y - mean_Y)^2))))
cat(sprintf("Residual Sum of Squares (RSS): %.4f\n", sum(residuals^2)))

cat("\nTrue Parameters (for reference):\n")
cat(sprintf("True Intercept (θ_b true): %.4f\n", baseline))
cat(sprintf("True Effect Size (θ true): %.4f\n", theta_true))
cat(sprintf("True Standard Deviation (σ true): %.4f\n", sd_y))


# Create a confidence interval for the effect size
ci_lower <- theta_hat - qt(0.975, df = n - 2) * sqrt(var_theta_hat)
ci_upper <- theta_hat + qt(0.975, df = n - 2) * sqrt(var_theta_hat)
cat("95% Confidence Interval for θ:", c(ci_lower, ci_upper), "\n")


--- Single Marker Linear Regression Results (Normalized Data) ---

Estimated Coefficients:
Intercept (θ̂_b): 0.0000
Effect Size (θ̂): 0.9976

Residuals:
Residual Standard Deviation (σ̂): 0.0975

Hypothesis Testing:
t-statistic: 14.4672
p-value: 0.0439

Model Fit:
R-squared: 0.9952
Residual Sum of Squares (RSS): 0.0095

True Parameters (for reference):
True Intercept (θ_b true): 170.0000
True Effect Size (θ true): 2.0000
True Standard Deviation (σ true): 1.0000
95% Confidence Interval for θ: 0.1214306 1.873809 
