# Intuition

Single marker linear regression in statistical genetics


![figure](./cartoons/2_1.svg)

# Notations

## Single variant (Uni-variant regression model)

In the single marker linear regression, e.g., genetic variant $X_{\cdot,j}$, we can express the relationship in matrix form:

$$
\mathbf{y} = \mathbf{X}_{j} \beta  + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(0, \sigma^2)
$$

Where:
- $\mathbf{y}$ is the $N \times 1$ vector of trait values for $N$ individuals (scaled)
- $X_{j}$ is the $N \times 1$ vector of genotypes for variant $j$ across all individuals
- $\beta$ is the effect size of the genetic variant $j$
- $\boldsymbol{\epsilon}$ is the $N \times 1$ vector of error terms for $N$ individuals

For individual $i$, this model can be written as:

$$
y_i =  \beta \mathbf{X}_{i,j} + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)
$$

## Multiple variants (multiple variants regression model)

In the multiple marker linear regression, we can express the relationship in matrix form in a similar way:

$$
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta}  + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(0, \sigma^2\mathbf{I})
$$

Where:
- $\mathbf{Y}$ is the $N \times 1$ vector of trait values for $N$ individuals
- $\mathbf{X}$ is the $N \times M$ vector of genotypes for $M$ variants across $N$ individuals
- $\boldsymbol{\beta}$ is the $M \times 1$ vector of effect size for $M$ genetic variants
- $\boldsymbol{\epsilon}$ is the $N \times 1$ vector of error terms for $N$ individuals


## Ordinary Least Squares (OLS)

Using Ordinary Least Squares (OLS), we can derive the estimators for $\boldsymbol{\beta}$ in matrix form:

$$
\hat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}
$$

## GWAS

In genome-wide association study (GWAS), there are millions (or even billions) of genetic variants, i.e., $M >> N$. In such case, it becomes impossible to apply this model, because $\mathbf{X}^T\mathbf{X}$ will become singular. Therefore, GWAS tests each SNP independently using a **univariate regression model**, ensuring statistical simplicity and easier interpretation by avoiding complex interactions. Given that millions of SNPs are analyzed, modeling them jointly would introduce too many parameters, leading to overfitting. Independent testing also enables the application of multiple testing correction methods, such as Bonferroni or FDR, to control false positives.


# Example

In [1]:
rm(list=ls())
set.seed(21)  # For reproducibility
# Genotype matrix for 100 individuals and 3 variants
N <- 100  # Number of individuals
M <- 3    # Number of SNPs (variants)

# Create a random genotype matrix (0, 1, 2 values for each SNP)
X_raw <- matrix(sample(0:2, N * M, replace = TRUE), nrow = N, ncol = M)

# Adding row and column names
rownames(X_raw) <- paste("Individual", 1:N)
colnames(X_raw) <- paste("Variant", 1:M)
head(X_raw,3)

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,2,0,1
Individual 2,0,2,1
Individual 3,2,1,0


In [2]:
# standardize genotype matrix
X = scale(X_raw, scale=TRUE)
head(X)

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,1.3718249,-1.4400798,0.06089224
Individual 2,-1.1685916,1.086376,0.06089224
Individual 3,1.3718249,-0.1768519,-1.1569526
Individual 4,-1.1685916,1.086376,-1.1569526
Individual 5,0.1016167,-0.1768519,-1.1569526
Individual 6,1.3718249,1.086376,-1.1569526


In [3]:
y <- rnorm(N, mean = 0, sd = 1)
print("Standardized Response Vector y:")
y[1:10]

[1] "Standardized Response Vector y:"


In [4]:
# Perform GWAS-style analysis: Test each SNP independently using OLS
p_values <- numeric(M)  # Store p-values
betas <- numeric(M)     # Store estimated effect sizes

for (j in 1:M) {
  SNP <- X[, j]  # Extract genotype for SNP j
  model <- lm(y ~ SNP)  # OLS regression: Trait ~ SNP
  summary_model <- summary(model)
  
  # Store p-value and effect size (coefficient)
  p_values[j] <- summary_model$coefficients[2, 4]  # p-value for SNP effect
  betas[j] <- summary_model$coefficients[2, 1]     # Estimated beta coefficient
}


In [5]:
# Create results table
gwas_results <- data.frame(Variant = colnames(X), Beta = betas, P_Value = p_values)
print("GWAS Results:")
print(gwas_results)

[1] "GWAS Results:"
    Variant        Beta   P_Value
1 Variant 1 -0.10376445 0.2865036
2 Variant 2  0.10904197 0.2625546
3 Variant 3 -0.07918321 0.4165944


In [6]:
# OLS solution: Compute estimated SNP effects using matrix algebra (no intercept)
beta_hat_OLS <- solve(t(X) %*% X) %*% t(X) %*% y  # OLS formula without intercept
print("OLS Solution (SNP effects without intercept):")
print(beta_hat_OLS)


[1] "OLS Solution (SNP effects without intercept):"
                 [,1]
Variant 1 -0.08316727
Variant 2  0.11533546
Variant 3 -0.08539224
