# Collider

A collider is a variable that is **influenced by two other variables of interest**, creating a **spurious** association between them when we condition on (select or control for) the collider in our analysis.

# Graphical Summary

![Fig](./graphical_summary/slides/Slide15.png)

# Key Formula

The key formula for the concept of a collider is represented in a causal diagram as:

$$
X \rightarrow C \leftarrow Y
$$

Where:
- $C$ is the collider variable
- $X$ is one cause of the collider
- $Y$ is another cause of the collider
- The arrows (→) indicate the direction of causal influence

This diagram illustrates that a collider ($C$) is a variable that is *caused by* both the exposure ($X$) and the outcome ($Y$), creating a situation where $X$ and $Y$ both flow into $C$.

A **collider** is a variable that is influenced by two other variables in a causal pathway. When we condition on (adjust for, stratify by, or select based on) a collider, we can induce a spurious association between its causes, even if they were originally independent.


# Technical Details

## What Happens When We Control for Colliders

When a collider is present and incorrectly controlled for:

$$
\text{Observed Association} = \text{True Effect} + \text{Collider Bias}
$$

- **True Effect**: The real biological relationship (may be zero)
- **Collider Bias**: The false association created by conditioning on the collider
- **Observed Association**: What we measure after incorrectly adjusting (often misleading!)

## The Problem: Conditioning on Colliders Creates Bias

Unlike confounders, colliders should **NOT** be included in regression models. Including a collider as a covariate can create spurious associations:

$$
Y = \beta_0 + \beta_1 X + \beta_2 \text{Collider} + \epsilon \quad \text{(WRONG!)}
$$

This regression will give a **biased estimate** of $\beta_1$ even when the true effect is zero.

## Why This Happens: Selection Bias

Controlling for a collider creates **selection bias** by conditioning on a variable that depends on both exposure and outcome:

1. **Collider structure**: $X \rightarrow \text{Collider} \leftarrow Y$
2. **Conditioning effect**: When you control for the collider, you're selecting specific combinations of X and Y
3. **Induced association**: This selection creates an artificial association between X and Y

## Common Colliders in Genetic Studies

- Study Participation/Selection: Genetic Risk → Study Participation ← Disease Status
- Hospital Admission: Genetic Variant → Hospital Admission ← Disease Severity  
- Survival to Study Age: Protective Alleles → Survival ← Disease Resistance. Studying only elderly survivors can bias estimates of genetic effects on longevity.


## The Key Principle

- **Confounders**: Control to remove bias
- **Colliders**: Don't control to avoid creating bias

# Example

This example demonstrates collider bias, a common source of spurious associations in genetic studies. In our scenario, genetic variants affect BMI, and waist circumference also affects BMI, but there is no direct causal relationship between the genetic variants and waist circumference. BMI acts as a "collider" because it is influenced by both the genetic variants and waist circumference. When we incorrectly adjust for BMI in our analysis, we create a spurious association between genetic variants and waist circumference, even though no true direct relationship exists. This illustrates why careful consideration of causal relationships is essential before including variables as covariates in statistical models.

We perform two analyses:

- A correct analysis testing associations between genetic variants and waist circumference
- An incorrect analysis controlling for BMI (which creates bias)

The example shows how BMI is caused by both genetic factors AND waist circumference, making it a collider. When we incorrectly control for BMI, we create an artificial association between SNPs and waist circumference where none should exist biologically.


Related topics:
- [OLS](https://gaow.github.io/statgen-prerequisites/ordinary_least_squares.html)
- [confounder](https://gaow.github.io/statgen-prerequisites/confounder.html)

Let's first generate the data for genotype variants, BMI and the waist circumference (independent of genetics).

In [59]:
# Clear the environment
rm(list = ls())
set.seed(15)  # For reproducibility

N = 30  # Sample size
M = 3   # Number of variants

# Generate genetic variants that affect BMI
Xraw_additive <- matrix(sample(0:2, N*M, replace = TRUE, prob = c(0.3, 0.4, 0.3)), 
                        nrow = N, ncol = M)
rownames(Xraw_additive) <- paste("Individual", 1:N)
colnames(Xraw_additive) <- paste("Variant", 1:M)

X <- scale(Xraw_additive, center = TRUE, scale = TRUE)

# Generate waist circumference (independent of genetics)
waist_circumference <- rnorm(N, mean = 85, sd = 10)
waist_scaled <- scale(waist_circumference)

# Generate BMI as COLLIDER: caused by BOTH genetics AND waist circumference
genetic_effect_on_bmi <- 2 * Xraw_additive[, 3]        # Only Variant 3 affects BMI
waist_effect_on_bmi <- 0.15 * waist_circumference      # Waist circumference affects BMI

# BMI is the sum of genetic and waist effects plus noise
bmi_raw <- 25 + genetic_effect_on_bmi + waist_effect_on_bmi + rnorm(N, 0, 1)
bmi_scaled <- scale(bmi_raw)

Based on how the waist circumference is generated, it is independent from the genetic effect. Thus we should expect no signals when we test for the associations between waist circumference and SNPs. So we perform two analysis here:
- ignore the collider (BMI), regress waist circumference on SNPs
- consider the collider (BMI), regress waist circumference on SNPs and BMI

In [60]:
p_values_correct <- numeric(M)  # Store p-values for correct analysis
betas_correct <- numeric(M)     # Store estimated effect sizes for correct analysis
p_values_biased <- numeric(M)   # Store p-values for biased analysis (controlling for collider)
betas_biased <- numeric(M)      # Store estimated effect sizes for biased analysis

# Perform OLS regression for each SNP
for (j in 1:M) {
  SNP <- X[, j]  # Extract genotype for SNP j
  
  # CORRECT analysis: SNP vs waist circumference (no BMI control)
  correct_model <- lm(waist_scaled ~ SNP)  
  
  # INCORRECT analysis: Control for BMI (the collider)
  biased_model <- lm(waist_scaled ~ SNP + bmi_scaled)  
  
  summary_correct <- summary(correct_model)
  summary_biased <- summary(biased_model)
  
  # Store p-values and effect sizes
  p_values_correct[j] <- summary_correct$coefficients[2, 4]  # p-value for SNP effect (correct)
  betas_correct[j] <- summary_correct$coefficients[2, 1]     # Estimated beta coefficient (correct)
  p_values_biased[j] <- summary_biased$coefficients[2, 4]    # p-value for SNP effect (biased)
  betas_biased[j] <- summary_biased$coefficients[2, 1]       # Estimated beta coefficient (biased)
}

The results are:

In [61]:
# Create results table
results <- data.frame(
  Variant = paste("Variant", 1:M),
  Beta_Correct = round(betas_correct, 4),
  P_Value_Correct = round(p_values_correct, 4),
  Significant_Correct = ifelse(p_values_correct < 0.05, "Yes", "No"),
  Beta_Biased = round(betas_biased, 4),
  P_Value_Biased = round(p_values_biased, 4),
  Significant_Biased = ifelse(p_values_biased < 0.05, "Yes", "No")
)
results

Variant,Beta_Correct,P_Value_Correct,Significant_Correct,Beta_Biased,P_Value_Biased,Significant_Biased
<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>
Variant 1,0.0645,0.7347,No,0.1841,0.2437,No
Variant 2,0.2585,0.1677,No,0.0667,0.689,No
Variant 3,-0.0203,0.9154,No,-0.7733,0.0,Yes


This example illustrates the dangers of collider bias in genetic association studies. The correct analysis shows no significant associations between genetic variants and waist circumference, which reflects the true underlying biology since we generated the data with no direct causal pathway between them. However, when we incorrectly adjust for BMI (the collider), we observe spurious significant associations. This occurs because BMI is causally influenced by both the genetic variants and waist circumference, so conditioning on BMI creates an artificial correlation between the variants and waist circumference. In real-world studies, researchers must carefully consider the causal structure of their variables to avoid such biased conclusions that could mislead scientific understanding and clinical decision-making.