# Intuition



# Notations

In statistical genetics, understanding the relationships between variables is crucial for accurate analysis. Confounders, colliders, and mediators play important roles in shaping these relationships, particularly in causal inference. Below, we explore these concepts with examples relevant to genetic studies.

## Confounder

A **confounder** is a variable that influences both the independent variable (genotype) and the dependent variable (trait), creating a spurious relationship between the two. If not controlled for, a confounder can distort the true relationship between the genotype and the trait.

- **Example in Statistical Genetics:**  
  Suppose we are studying the effect of a genetic variant (genotype) on a trait (e.g., height). If **population stratification** or **ancestry** influences both the genotype and height, it is a confounder. Failing to control for ancestry could lead to biased estimates of the genetic effect on the trait.

- **Graphical Representation: [FIXME -- do the figure in a better way]**
$$ \text{Genotype} \leftarrow \textbf{Ancestry} \to \text{Height} $$  

Here, **ancestry** represents the confounder that affects both the genotype and the trait (height).

Confounding can cause bias in the estimation of the relationship between the genotype and the trait. To avoid this bias, confounders **must be controlled** in the analysis, typically by including them as covariates in regression models.


## Collider

A **collider** is a variable that is influenced by both the independent variable (genotype) and the dependent variable (trait). Conditioning on a collider (by including it in the model) can create a spurious relationship between the genotype and the trait.

- **Example in Statistical Genetics:**  
  Consider studying the effect of a genetic variant (genotype) on a trait (e.g., height), but both the genotype and the trait are influenced by a third variable, **age**. If we condition on **age**, we might introduce collider bias.

- **Graphical Representation:**
  $$ \text{Genotype} \to \textbf{Age} \leftarrow \text{Height} $$  
  Conditioning on **age** can create an association between the genotype and the trait that is not causal.

  In this case, **age** is a **collider** because it is influenced by both the genotype (via genetic predisposition to age-related traits) and the height (through factors like growth patterns or age-related changes in height). By conditioning on **age** (e.g., including it as a covariate in the regression model), we can introduce a spurious relationship between the genotype and the height, which does not reflect a true causal effect. 

Therefore, colliders like **age** should generally be avoided as covariates in statistical models to prevent introducing biased or misleading results.


## **Mediator**

A **mediator** is a variable that lies on the causal pathway between the independent variable (genotype) and the dependent variable (trait). It explains how or why the independent variable affects the dependent variable.

- **Example in Statistical Genetics:**  
  Suppose we are studying the effect of a genetic variant (genotype) on a trait (e.g., height). A potential **mediator** could be **BMI** (Body Mass Index), which mediates the relationship between the genetic variant and the trait. In this case, BMI helps explain how the genetic variant influences height.

- **Graphical Representation:**
  $$ \text{Genotype} \to \textbf{BMI} \to \text{Height} $$  
  Here, **BMI** is a mediator that explains the causal effect of the genotype on height.

Understanding mediation allows us to estimate both the **direct effect** of the genotype on the trait and the **indirect effect** through the mediator. Depending on the research question, mediators may or may not be included in the model.


## Key Differences and Summary

| Type         | Definition | Direction of Influence | Impact on Relationship | Example in Genetics |
|--------------|------------|------------------------|------------------------|---------------------|
| **Confounder** | A variable that affects both genotype and trait, potentially distorting the observed relationship between them. | Ancestry $\to$ Genotype $\to$ Height<br>Ancestry $\to$ Height | Can create spurious associations between genotype and trait. Needs to be controlled for. | Ancestry affects both genotype and height. |
| **Collider**  | A variable that is influenced by both genotype and trait. Conditioning on a collider can create a spurious association. | Genotype $\to$ Genetic Predisposition $\leftarrow$ Height | Can create false associations if conditioned on. | Genetic predisposition affects both genotype and height. |
| **Mediator**  | A variable that lies on the causal pathway between genotype and trait and explains how genotype affects the trait. | Genotype $\to$ Bone Density $\to$ Height | Provides insight into the causal mechanism. | Bone density mediates the relationship between genotype and height. |



## Practical Example in Statistical Genetics

Let’s consider a simple example of studying the effect of a genetic variant (genotype) on height using linear regression:

$$ \text{Height}_i = \beta_0 + \beta_1 \text{Genotype}_j + \epsilon_i $$

Where:

- $\text{Height}_i$ is the height of individual $i$.
- $\text{Genotype}_j$ is the genotype of individual $i$ for variant $j$ (e.g., the number of minor alleles).
- $\beta_0$ is the baseline height.
- $\beta_1$ is the effect size of the genetic variant on height.
- $\epsilon_i$ is the error term.


In this model, the relationship between the genotype and the trait could be confounded by variables like **ancestry**, mediated by variables like **bone density**, or distorted by **colliders** like **genetic predisposition**.

By correctly identifying and adjusting for these variables, we can obtain a more accurate estimate of the genetic effect on the trait.



# Example

In [49]:
rm(list=ls())
set.seed(5)

# Simulate true mean and effect size
baseline <- 170  # Population mean of the trait (e.g., height in cm) when the genetic variant has no effect (Model 1)
theta_true <- 2  # True effect size of the genetic variant. This represents the change in height (in cm) associated with each additional minor allele (Model 2)
sd_y <- 1  # Standard deviation of the trait (e.g., variability in height measurement within the population)

# Simulate genotype and height values for a larger sample
n <- 1000  # Number of individuals in the dataset
genotype <- sample(c(0, 1, 2), n, replace = TRUE)  # Genotypes for 1000 individuals


In [50]:
# Simulate the covariates
# 1. Confounder: Ancestry (affects both genotype and height)
ancestry <- rnorm(n, mean = 0.3, sd = 0.1)

# 2. Collider: Age (affected by both genotype and height)
age <- rnorm(n, mean = 40, sd = 10)

# 3. Mediator: BMI (affects genotype-height relationship)
bmi <- rnorm(n, mean = 25, sd = 4)

In [51]:
# Simulate height values for the individuals with the following model:
# height = baseline + theta_true * genotype + 5 * ancestry - 0.3 * bmi + 0.1 * age + random noise
height_values <- rnorm(n, mean = baseline + theta_true * genotype + 5 * ancestry - 0.3 * bmi + 0.1 * age, sd = sd_y)

# Create a data frame with genotype, height, and covariates
data <- data.frame(
  genotype = genotype,
  height = height_values,
  ancestry = ancestry,
  bmi = bmi,
  age = age
)

head(data,3)

Unnamed: 0_level_0,genotype,height,ancestry,bmi,age
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,170.4351,0.4497451,27.07505,29.11596
2,2,173.7714,0.3343624,21.02533,45.59249
3,0,173.9096,0.3385473,14.93555,44.23577


In [52]:
# Perform linear regression without considering any covariates (Model 1)
model_no_covariates <- lm(height ~ genotype, data = data)

# Perform linear regression including covariates (Model 2)
model_with_covariates <- lm(height ~ genotype + ancestry + bmi + age, data = data)

# Summary of both models
print("================ not considering covaraites ===========")
summary(model_no_covariates)

print("================ considering covaraites ===========")
summary(model_with_covariates)




Call:
lm(formula = height ~ genotype, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4594 -1.3515  0.0065  1.4157  5.8707 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 168.0390     0.0973  1726.9   <2e-16 ***
genotype      1.9854     0.0755    26.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.956 on 998 degrees of freedom
Multiple R-squared:  0.4093,	Adjusted R-squared:  0.4087 
F-statistic: 691.5 on 1 and 998 DF,  p-value: < 2.2e-16





Call:
lm(formula = height ~ genotype + ancestry + bmi + age, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6532 -0.6696  0.0146  0.6191  3.4064 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 170.108434   0.267998  634.74   <2e-16 ***
genotype      1.958498   0.039692   49.34   <2e-16 ***
ancestry      4.816290   0.321812   14.97   <2e-16 ***
bmi          -0.304198   0.008319  -36.57   <2e-16 ***
age           0.102633   0.003233   31.74   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 995 degrees of freedom
Multiple R-squared:  0.8382,	Adjusted R-squared:  0.8375 
F-statistic:  1288 on 4 and 995 DF,  p-value: < 2.2e-16


Including covariates (ancestry, BMI, and age) in the model improves both the model fit and the precision of the estimated effect of the genotype. Specifically:

- **Better Model Fit:** The increase in $R^2$ from 40.93% to 83.82% shows that the covariates are important in explaining the variation in height.
- **More Accurate Effect Size:** The change in the genotype effect from 1.99 to 1.96 is small, but it indicates that the estimate becomes more precise when adjusted for the influence of other factors. This adjustment leads to a better understanding of the true relationship between genotype and height.
- **Understanding Confounding:** Without adjusting for confounders like ancestry, BMI, and age, the estimate for genotype might be biased. By considering these covariates, we control for their potential confounding effects, leading to a more accurate estimate of the genotype-height relationship.

Thus, this analysis clearly demonstrates the importance of considering covariates to obtain unbiased and more reliable estimates in genetic studies.
