# Intuition



![figure](./cartoons/3_1.svg)

# Notations

> slide 86-90 from GW
>
> slide 197-202 from GW


## Linear Regression Model

Consider the single variant linear regression model for a trait vector $ \mathbf{y} $ and a genetic variant $ \mathbf{X}_j $ across $ N $ individuals:

$$
\mathbf{y} = \mathbf{X}_j \beta_j + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(0, \sigma^2)
$$

Where:
- $ \mathbf{y} $ is the $ N \times 1 $ vector of trait values for $ N $ individuals.
- $ \mathbf{X}_j $ is the $ N \times 1 $ vector of genotypes for variant $ j $ across all individuals.
- $ \beta_j $ is the effect size of the genetic variant $ j $.
- $ \boldsymbol{\epsilon} $ is the $ N \times 1 $ vector of error terms (residuals), assumed to follow a normal distribution with mean 0 and variance $ \sigma^2 $.


## Estimation of $ \beta_j $

To estimate $ \beta_j $, we use the **Ordinary Least Squares (OLS)** method, which minimizes the residual sum of squares (RSS) between the observed and predicted trait values. Recall lecture 2.1 that we get the following formula for the OLS estimator of $ \beta_j $:

$$
\hat{\beta}_j = \frac{\mathbf{X}_j^T \mathbf{y}}{\mathbf{X}_j^T \mathbf{X}_j}
$$

## Variance of $ \hat{\beta}_j $

The variance of $ \hat{\beta}_j $ can be calculated as:

$$
\text{Var}(\hat{\beta}_j) = \frac{\sigma^2}{\mathbf{X}_j^T \mathbf{X}_j}
$$

Where $ \sigma^2 $ is the residual variance, which is typically estimated from the residuals of the model.


## Example GWAS Summary Statistics Table

| SNP (rsID) | CHR | BP    | A1  | A2  | MAF  | BETA  | SE    | Z-score | P-value  | N     |
|------------|-----|-------|-----|-----|------|-------|-------|---------|----------|-------|
| rs12345    | 1   | 10583 | A   | G   | 0.12 | 0.045 | 0.010 | 4.50    | 1.2e-06  | 100000 |
| rs67890    | 2   | 20345 | C   | T   | 0.35 | -0.030 | 0.008 | -3.75   | 5.4e-04  | 95000  |
| rs54321    | 18   | 45678 | G   | A   | 0.22 | 0.060 | 0.012 | 5.00    | 2.1e-07  | 102000 |


- **SNP (rsID):** Identifier for the single nucleotide polymorphism, such as `rs12345` or `chr21:295472:G:A`.  
- **CHR:** Chromosome number where the SNP is located.  
  - *Other common names:* `chrom`, `chromosome`  
- **BP:** Genomic position of the SNP, specific to a reference genome assembly (e.g., GRCh37 or GRCh38).  
  - *Other common names:* `pos`, `position`  
- **A1 and A2:** The two alleles at the variant site, with one designated as the effect allele.  
  - *Other common names:* `REF and ALT`, `Effect_allele and Other_allele`  
- **MAF:** The frequency of the less common allele for variant $ j $ in the sample.
- **BETA:** Estimated effect size ($\hat{\theta}$), representing the association between the effect allele and the trait. Always check which allele it corresponds to.  
- **SE:** Standard error of the effect size estimat, reflecting the uncertainty in the effect size estimate. It is computed as $SE_j = \sqrt{\frac{\hat{\sigma}^2}{\mathbf{X}_j^T \mathbf{X}_j}}$, where $ \hat{\sigma}^2 $ is the estimated residual variance from the model.
- **Z:** Standardized test statistic, computed as $z = \frac{\beta}{\text{se}}$.  
- **P-value:** Significance of the association, testing whether the SNP has an effect on the trait, $p_j = 2 \times (1 - \Phi(|t_j|))$, where $ \Phi $ is the cumulative distribution function (CDF) of the standard normal distribution.
- **N:** Number of individuals included in the analysis.  
- **N_cases:** Number of individuals with the trait (cases), relevant for case-control studies.  
- **N_ctrls:** Number of individuals without the trait (controls), relevant for case-control studies.  


# Example

In [1]:
rm(list=ls())
set.seed(21)  # For reproducibility

# Genotype matrix for 100 individuals and 3 variants
N <- 100  # Number of individuals
M <- 3    # Number of SNPs (variants)

# Create a random genotype matrix (0, 1, 2 values for each SNP)
X_raw <- matrix(sample(0:2, N * M, replace = TRUE), nrow = N, ncol = M)

# Adding row and column names
rownames(X_raw) <- paste("Individual", 1:N)
colnames(X_raw) <- paste("Variant", 1:M)
# Standardize genotype matrix (mean = 0, sd = 1 for each SNP)
X <- scale(X_raw, scale = TRUE)

# Create a random response vector y (trait)
y <- rnorm(N, mean = 0, sd = 1)

# Perform GWAS-style analysis: Test each SNP independently using OLS
p_values <- numeric(M)  # Store p-values
betas <- numeric(M)     # Store estimated effect sizes
se_values <- numeric(M)  # Store standard errors
z_scores <- numeric(M)   # Store z-scores
maf_values <- numeric(M) # Store minor allele frequencies


In [2]:
for (j in 1:M) {
  SNP <- X[, j]  # Extract genotype for SNP j
  model <- lm(y ~ SNP)  # OLS regression: Trait ~ SNP
  summary_model <- summary(model)
  
  # Store p-value and effect size (coefficient)
  p_values[j] <- summary_model$coefficients[2, 4]  # p-value for SNP effect
  betas[j] <- summary_model$coefficients[2, 1]     # Estimated beta coefficient
  se_values[j] <- summary_model$coefficients[2, 2]  # Standard error
  
  # Calculate Z-score
  z_scores[j] <- betas[j] / se_values[j]
  
  # Calculate Minor Allele Frequency (MAF)
  # Assuming a MAF calculation based on allele counts in the genotype matrix
  maf_values[j] <- mean(X_raw[, j] == 1 | X_raw[, j] == 2) / 2  # Calculate MAF based on heterozygote and homozygote counts
}


In [3]:

# Combine the summary statistics into a data frame
summary_stats <- data.frame(
  SNP = colnames(X),
  BETA = betas,
  SE = se_values,
  Z_score = z_scores,
  P_value = p_values,
  MAF = maf_values
)

# Print the summary statistics
print("Summary Statistics for each SNP:")
print(summary_stats)


[1] "Summary Statistics for each SNP:"
        SNP        BETA         SE    Z_score   P_value   MAF
1 Variant 1 -0.10376445 0.09682566 -1.0716627 0.2865036 0.325
2 Variant 2  0.10904197 0.09676647  1.1268570 0.2625546 0.375
3 Variant 3 -0.07918321 0.09706234 -0.8157975 0.4165944 0.320


# Supplementary

## **Linear Mixed Models (LMM)**
LMMs account for relatedness between individuals by modeling both fixed and random effects. These models control for population structure and relatedness, reducing bias in association studies. Many new GWAS softwares includes REGENIE, BOLT-LMM, fastGWA and SAIGE. **[FIXME add references here]**

# TODO 

- [ ] what about LD here? Do we want to include fine-mapping?
> slide 206-207 from GW