# Intuition

collider


# Notations

## Collider

A **collider** is a variable that is influenced by both the independent variable (SNP) and the dependent variable (height). When we condition on a collider, we may introduce bias by creating a spurious association between the independent and dependent variables.

## Example in Statistical Genetics

Suppose we are studying the effect of a genetic variant (SNP) on a trait (e.g., height). **Study participation** can act as a collider between SNP and height.

- **SNP influences study participation**: Certain genetic variants are associated with specific traits (like height), making individuals with these variants more likely to participate in studies related to those traits.
- **Height influences study participation**: Taller individuals may be more likely to participate in a study about height due to an inherent interest or preexisting knowledge about their trait.

- Graphical Representation

$$ \text{SNP} \to \textbf{Study Participation} \leftarrow \text{Height} $$

## Why Study Participation is a Collider:

1. **SNP influences study participation**: Individuals with specific SNPs may be more predisposed to certain traits (like height), making them more likely to participate in studies focusing on those traits.
2. **Height influences study participation**: Taller people may have an interest in studying height-related genetics or may be more aware of their height as a characteristic, increasing their likelihood to participate in height-focused studies.
3. **Collider bias**: If we control for study participation in the analysis, we condition on a variable that is influenced by both the SNP and height. This opens up a backdoor path between the SNP and height, leading to a **spurious association** between the two.


# Example

In [86]:
rm(list=ls())
set.seed(52)  # For reproducibility
library(dplyr)
# Genotype matrix for 100 individuals and 3 variants
N <- 100  # Number of individuals
M <- 3    # Number of SNPs (variants)

# Create genotype matrix (0, 1, 2 values for each SNP)
X_raw <- matrix(sample(0:2, N * M, replace = TRUE), nrow = N, ncol = M)

# Add row and column names
rownames(X_raw) <- paste("Individual", 1:N)
colnames(X_raw) <- paste("Variant", 1:M)

# Standardize genotype matrix (mean = 0, sd = 1 for each SNP)
X <- scale(X_raw, scale = TRUE)

# Generate height (y) without any causal effect from SNP to y
y <- rnorm(N, mean = 0, sd = 1)  # Random height, no effect from SNP


In [87]:
# Simulate study participation based on SNP (second variant) and height (y)
# Assume individuals with mutation on Variant 2 or those with higher height have a higher chance of participation

# Define probability of study participation
prob_participation <- 1.5 * (X[, 2] >= 1) + 1.5 * (y > median(y))  # Stronger effect for mutation or being taller

# Simulate participation (1 = participate, 0 = not participate)
study_participation <- ifelse(runif(N) < prob_participation, 1, 0)

# Create data frame with the results
data <- data.frame(Individual = 1:N, y = y, Study_Participation = factor(study_participation, levels = c(0, 1), labels = c("No", "Yes")))
data <- cbind(data, X)  # Add genotype data

In [88]:
# Filter data for participants only (Study_Participation == "Yes")
participants_data <- data %>% filter(Study_Participation == "Yes")
# Run linear regression on participants only
model_participants <- lm(y ~ `Variant 2`, data = participants_data)

# Display the summary of the regression model
summary(model_participants)



Call:
lm(formula = y ~ `Variant 2`, data = participants_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.77226 -0.51515 -0.07965  0.49138  1.75208 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.46638    0.09699   4.809 9.37e-06 ***
`Variant 2` -0.34349    0.09347  -3.675 0.000483 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7585 on 65 degrees of freedom
Multiple R-squared:  0.172,	Adjusted R-squared:  0.1593 
F-statistic: 13.51 on 1 and 65 DF,  p-value: 0.0004835



In this analysis, we are examining the relationship between `Variant 2` (the SNP mutation) and `y` (height) using data from only the participants in the study. The results are as follows:

The regression shows that `Variant 2` is significantly associated with `y` (p-value = 0.000483), with an estimated effect of -0.34349. The p-value indicates that `Variant 2` has a statistically significant effect on height in the study participants.


However, the significant relationship between `Variant 2` and height in the study participants may be misleading due to the presence of a **collider bias**. The **study participation** variable is influenced by both the SNP (mutation) and height, making it a collider between these two factors. Here’s why this is a problem:

- **Study Participation as a Collider**: Individuals with the SNP mutation (`Variant 2`) or those who are taller (`y > median(y)`) are more likely to participate in the study. This introduces a bias because participation is not random but dependent on both the genetic variant and the trait being studied (height).
  
- **How the Collider Biases the Results**: 
  - The regression is conducted only on those individuals who participated in the study. Therefore, by conditioning on study participation, we are conditioning on a variable that is influenced by both the genotype and the phenotype (height).
  - This creates a spurious association between `Variant 2` and height because the study participation variable "opens a backdoor path" from the SNP to height, making it appear as though there is a direct relationship between `Variant 2` and height when, in fact, the true effect may be confounded by participation.

In this case, the estimated effect of `Variant 2` on height may be **biased** and not reflect the true relationship between the SNP and height. The effect seen in this model may arise from the selection of participants based on their genotype and height rather than a direct causal effect of the SNP on height.
