# Intuition



summary statistics

# Notations

**GWAS summary statistics** summarize the association between genetic variants and a trait. We can get the summary statistics by running a linear regression model (either FEM or REM, in lecture 2):

$$
y_i = \beta_0 + \beta x_{i,j} + \epsilon_i
$$

where:
- $ y_i $ is the trait value for individual $ i $,
- $ x_{i,j} $ is the genotype of individual $ i $ at variant $ j $,
- $ \beta_b $ is the baseline level (intercept),
- $ \veta $ is the effect size of the variant,
- $ \epsilon_i \sim N(0, \sigma^2) $ represents the residual error.


## **Example GWAS Summary Statistics Table**

| SNP (rsID) | CHR | BP | A1 | A2 | MAF | BETA | SE | P-value | N |
|------------|-----|----|----|----|-----|------|----|--------|----|
| rs12345 | 1 | 10583 | A | G | 0.12 | 0.045 | 0.010 | 1.2e-06 | 100000 |
| rs67890 | 2 | 20345 | C | T | 0.35 | -0.030 | 0.008 | 5.4e-04 | 95000 |
| rs54321 | 3 | 45678 | G | A | 0.22 | 0.060 | 0.012 | 2.1e-07 | 102000 |


- **SNP (rsID):** Identifier for the single nucleotide polymorphism, such as `rs12345` or `chr21:295472:G:A`.  
- **CHR:** Chromosome number where the SNP is located.  
  - *Other common names:* `chrom`, `chromosome`  
- **BP:** Genomic position of the SNP, specific to a reference genome assembly (e.g., GRCh37 or GRCh38).  
  - *Other common names:* `pos`, `position`  
- **A1 and A2:** The two alleles at the variant site, with one designated as the effect allele.  
  - *Other common names:* `REF and ALT`, `Effect_allele and Other_allele`  
- **MAF:** Frequency of the minor allele in the dataset.  
- **BETA:** Estimated effect size ($\hat{\theta}$), representing the association between the effect allele and the trait. Always check which allele it corresponds to.  
- **SE:** Standard error of the effect size estimate.  
- **Z:** Standardized test statistic, computed as $Z = \frac{\text{BETA}}{\text{SE}}$.  
- **P-value:** Significance of the association, testing whether the SNP has an effect on the trait.  
- **N:** Number of individuals included in the analysis.  
- **N_cases:** Number of individuals with the trait (cases), relevant for case-control studies.  
- **N_ctrls:** Number of individuals without the trait (controls), relevant for case-control studies.  


## **Interpretation of Results**

When working with summary statistics, understanding the context and significance of each value is key to interpreting the results properly. Here are important aspects to consider when interpreting the results from a GWAS or genetic study:

1. **P-value and Significance**  
   - The **P-value** indicates the probability of observing an effect at least as extreme as the one found, assuming there is no true effect (null hypothesis).  
   - The most widely used threshold for **genome-wide significance** is **$5 \times 10^{-8}$**, which accounts for multiple testing due to the large number of SNPs tested. This is considered the cutoff for identifying variants strongly associated with the trait.  
   - **P-values below this threshold** indicate a significant association, while **P-values above it** suggest no strong evidence of association.  
   - **Suggestive association**: Some studies use a less stringent threshold, such as **$1 \times 10^{-5}$**, to identify variants that might be of interest for further investigation.

2. **Effect Size (BETA) and Direction**  
   - The **BETA** represents the estimated effect of the variant on the trait. A **positive BETA** means that the effect allele (A1) increases the trait value, while a **negative BETA** means it decreases the trait value.  
   - The **size of BETA** reflects the magnitude of the association: larger absolute values suggest stronger associations.
   - It's crucial to verify which allele is the effect allele (A1 or A2) to understand the direction of the association properly.

3. **Standard Error (SE) and Precision**  
   - **SE** indicates the uncertainty of the effect size estimate. Smaller **SE** values suggest higher precision in the effect size estimate, meaning that the association is more confidently established.  
   - **Larger SE** can indicate low statistical power, potentially due to small sample sizes or low minor allele frequencies (MAF), leading to less reliable estimates.

4. **Z-score**  
   - The **Z-score** is calculated as $Z = \frac{\text{BETA}}{\text{SE}}$. It standardizes the effect size to give a measure of how many standard deviations the observed effect is from zero.  
   - A **larger absolute Z-score** (typically greater than 1.96) indicates stronger evidence for association. The larger the Z-score, the more confident you can be in the observed effect.

5. **Minor Allele Frequency (MAF)**  
   - The **MAF** is the frequency of the minor allele in the population. **Common variants** (MAF > 5%) tend to have higher statistical power for detection due to a larger sample size and greater variation.  
   - **Rare variants** (MAF < 1%) often require larger sample sizes or more sophisticated analytical methods, as they are harder to detect and tend to have higher SE.

6. **Sample Size (N) and Statistical Power**  
   - A **larger sample size (N)** increases the **statistical power** of the study, reducing the standard error and increasing the likelihood of detecting true associations.  
   - When interpreting results, check the **sample size** as variants studied in small samples are more prone to false positives or false negatives due to insufficient power.

7. **Case-Control Studies (N_cases and N_ctrls)**  
   - For **case-control studies**, the distribution of **N_cases** and **N_ctrls** is important. If the cases (affected individuals) are underrepresented, it may decrease the power of the analysis and make it harder to detect associations.  
   - A balance between cases and controls is crucial for robust findings. An overly large number of controls compared to cases may lead to imbalances that affect results.

8. **Genome Build and Variant Position (BP)**  
   - The **BP** (Base Pair Position) is specific to the reference genome assembly used (e.g., GRCh37, GRCh38).  
   - Always ensure that the **BP** corresponds to the same genome build across datasets when conducting meta-analyses or replication studies. Mismatches in reference builds could lead to inaccurate interpretations or difficulty in comparing results.


## **Marginal Effect Size and True Causal Effect**

The **BETA** reported in GWAS summary statistics represents the **marginal effect size** of a genetic variant on the trait. It indicates the association between the genetic variant and the phenotype, adjusting for other variants in the analysis but not accounting for potential pleiotropic effects (i.e., the variant affecting multiple traits) or confounding factors. Therefore, the BETA should be interpreted as the effect of the variant on the trait, but it does not necessarily represent the true **causal** effect of the variant.

### **Fine-Mapping to Identify Causal Variants**
To estimate the **true causal effect**, methods such as **fine-mapping** are employed. Fine-mapping aims to pinpoint the actual causal variants within a region of interest identified by GWAS. Given that GWAS typically identify broad regions with multiple variants, fine-mapping helps to prioritize variants by integrating evidence from:
1. **Statistical association** (GWAS results)
2. **Functional annotations** (e.g., chromatin interactions, gene expression)
3. **Biological understanding** (e.g., pathway analysis)

Fine-mapping methods include:
- **Colocalization analysis**: Identifies variants that co-localize with both gene expression and trait association signals.
- **Bayesian fine-mapping methods**: These methods, such as **SUSIE**, use probabilistic models to calculate the posterior probability of each variant being the causal one in the identified region (**[FIXME -- refer to the susie lecture]**).

By using fine-mapping techniques, researchers can obtain a more accurate understanding of the **causal variants** and their effects on the trait of interest, moving beyond the marginal associations detected by GWAS.

## **Summary**

One may wonder why fine-mapping is not always preferred over GWAS. GWAS summary statistics are valuable for understanding genetic associations with traits. While they provide marginal effect estimates, they are widely used due to their efficiency, broad applicability, and the ability to perform meta-analyses. However, for accurate causal inferences, fine-mapping techniques are needed to identify the true causal variants. There are several reasons for the continued use of GWAS summary statistics:

1. **No Individual Data Required**  
   - GWAS summary statistics can be used without individual-level data, avoiding privacy concerns.

2. **Smaller Data Size**  
   - Compared to full genotype data, summary statistics are more compact and manageable.

3. **Efficient First-Step Filtering**  
   - They enable quick filtering of candidate genetic variants, serving as a precursor for more detailed analyses.

4. **Broad Applicability**  
   - Summary statistics are useful across various traits and populations, and facilitate cross-trait genetic studies.

5. **Resource Efficiency**  
   - They support large-scale genetic analyses without the need for raw genotype data.

6. **Meta-Analysis Facilitation**  
   - Summary statistics are essential for combining data from multiple studies, increasing the power of association detection.

7. **Public Availability**  
   - Many repositories make GWAS summary statistics publicly available, promoting faster research progress without requiring access to raw genotype data.

8. **Correct Matched LD Required**  
   - To avoid bias, it is essential to use the correct matched linkage disequilibrium (LD) structure when working with summary statistics. Incorrect LD matching can lead to inaccurate or biased conclusions.


# Example

In [7]:
rm(list=ls())
set.seed(1)
# Simulate true mean and effect size
baseline <- 170  # Population mean of the trait (e.g., height in cm) when the genetic variant has no effect (Model 1)
theta_true <- 2  # True effect size of the genetic variant. This represents the change in height (in cm) associated with each additional minor allele (Model 2)
sd_y <- 1  # Standard deviation of the trait (e.g., variability in height measurement within the population)

# Simulate genotype and height values
genotype <- c(1, 2, 0)

# Simulate height values for three individuals based on genotypes
n = length(genotype)
height_values <- rnorm(n, mean = baseline + theta_true * genotype, sd = sd_y)
data <- data.frame(genotype = genotype, height = height_values)
data

genotype,height
<dbl>,<dbl>
1,171.3735
2,174.1836
0,169.1644


In [8]:
# Fit a linear regression model with normalized data
lm_model <- lm(data$height ~ data$genotype)  # Use the second column of X (genotype) as the predictor, the first column is all 1 for intercept
# Create summary statistics
summary_stats <- data.frame(
  SNP = "rs12345",        # Example SNP identifier
  CHR = 1,                # Example chromosome
  BP = 12345678,          # Example base pair position
  A1 = "A",               # Effect allele (minor allele)
  A2 = "G",               # Other allele (major allele)
  MAF = pmin(mean(data$genotype) / 2, 1-mean(data$genotype) / 2),       # Minor allele frequency (assumed as mean of genotype normalized values)
  BETA = coef(lm_model)[2],  # Effect size estimate (slope of genotype in the regression model)
  SE = summary(lm_model)$coefficients[2, 2],  # Standard error of BETA
  Z = coef(lm_model)[2] / summary(lm_model)$coefficients[2, 2],  # Z-score
  P_value = summary(lm_model)$coefficients[2, 4],  # P-value for BETA
  N = n                   # Sample size
)
rownames(summary_stats) <- NULL

summary_stats

SNP,CHR,BP,A1,A2,MAF,BETA,SE,Z,P_value,N
<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
rs12345,1,12345678,A,G,0.5,2.509636,0.1734713,14.46715,0.04393462,3


# Extension reading material

## **Linear Mixed Models (LMM)**
LMMs account for relatedness between individuals by modeling both fixed and random effects. These models control for population structure and relatedness, reducing bias in association studies. Many new GWAS softwares includes REGENIE, BOLT-LMM, fastGWA and SAIGE. **[FIXME add references here]**

# TODO 

- [ ] what about LD here? Do we want to include fine-mapping?