# Confounder

A confounder is a variable that **influences both the exposure and outcome independently**, creating a **misleading association** between them that doesn't represent a true causal relationship.

# Graphical Summary

![Fig](./graphical_summary/slides/Slide14.png)

# Key Formula

The key formula for the concept of a confounder is represented in a causal diagram as:

$$
X ← C → Y
$$

Where:
- $C$ is the confounder variable
- $X$ is the exposure/treatment variable 
- $Y$ is the outcome variable
- The arrows (←, →) indicate the direction of causal influence

This diagram illustrates that a confounder ($C$) has a direct causal effect on both the exposure ($X$) and the outcome ($Y$), creating a "backdoor path" between $X$ and $Y$ that must be blocked to obtain an unbiased estimate of the causal effect.


# Technical Details



## What Happens When We Ignore Confounders

When a confounder is present but not controlled:

$$
\text{Observed Association} = \text{True Effect} + \text{Confounding Bias}
$$

- **True Effect**: The real biological relationship we want to find
- **Confounding Bias**: The false association created by the confounder
- **Observed Association**: What we actually measure (often misleading!)


## The Solution: Control for Confounders

The most common and practical solution is **regression adjustment** - simply include confounders as additional variables in your model:

$$
Y = \beta_0 + \beta_1 X + \beta_2 C_1 + \beta_3 C_2 + \ldots + \epsilon
$$

Where:
- $Y$ = outcome (e.g., height, disease status)
- $X$ = genetic variant of interest  
- $C_1, C_2, \ldots$ = confounders (e.g., age, ancestry, sex)
- $\beta_1$ = the **unbiased** effect of the genetic variant

Here are the common approaches in genetic studies:
- Principal Components (Most Common): Control for population structure by including top PCs:
  $$
  \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + PC1 + PC2 + PC3 + Age + Sex
  $$
- Linear Mixed Models: Use genetic relationship matrices for complex population structure:
  $$
  \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Zu} + \boldsymbol{\epsilon}
  $$
    where $\mathbf{u} \sim N(0, \sigma^2 G)$, G is kinship matrix
- Stratified Analysis: Analyze each ancestry group separately, then combine results:
  1. Europeans: Trait ~ SNP + Age + Sex
  2. Asians: Trait ~ SNP + Age + Sex  
  3. Meta-analyze results

The goal is to block **backdoor paths** while keeping the **direct causal path** open.

# Example

This example demonstrates how to identify and control for confounding in genetic association studies. We create a simple dataset with:

- Genetic variants for 5 individuals
- Height measurements
- Ancestry information (the confounder)

We perform two analyses:

- A naive analysis testing associations between genetic variants and height
- An adjusted analysis controlling for ancestry as a confounder

The example shows how ancestry can create spurious associations between certain genetic variants and height, and how proper statistical adjustment helps reveal the true underlying relationships by blocking this "backdoor path" created by the confounder.

Related topics:
- [OLS](https://gaow.github.io/statgen-prerequisites/ordinary_least_squares.html)

In [25]:
# Clear the environment
rm(list = ls())

# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)
# Reshape into a matrix
N = 5
M = 3
geno_matrix <- matrix(genotypes, nrow = N, ncol = M, byrow = TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")

# Convert to raw genotype matrix using the additive / dominant / recessive model
Xraw_additive <- matrix(0, nrow = N, ncol = M) # dount number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i,j], "")[[1]]
    Xraw_additive[i,j] <- sum(alleles == alt_alleles[j])
  }
}

X <- scale(Xraw_additive, center = TRUE, scale = TRUE)

Now let's introduce a confounder: Ancestry. We'll create two ancestry groups that correlate with both genotype and phenotype.

In [39]:
ancestry <- c(1, 2, 1, 2, 1) # 1=Population A, 2=Population B
ancestry_factor <- as.factor(ancestry)

We assign the trait information (height) same as in previous notebooks:

In [44]:
# assign observed height for the 5 individuals
Y_raw <- c(180, 160, 158, 155, 193)
Y <- scale(Y_raw)

Then we run the OLS regression and see how the results different after adjusting for ancestry.

In [42]:
p_values <- numeric(M)  # Store p-values
betas <- numeric(M)     # Store estimated effect sizes
p_values_adjusted <- numeric(M)  # Store p-values adjusted for ancestry
betas_adjusted <- numeric(M)     # Store estimated effect sizes adjusted for ancestry
# Perform OLS regression for each SNP
for (j in 1:M) {
  SNP <- X[, j]  # Extract genotype for SNP j
  model <- lm(Y ~ SNP)  # OLS regression: Trait ~ SNP
  adjusted_model <- lm(Y ~ SNP + ancestry_factor)  # Adjust for ancestry
  summary_model <- summary(model)
  summary_adjusted_model <- summary(adjusted_model)
  # Store p-value and effect size (coefficient)
  p_values[j] <- summary_model$coefficients[2, 4]  # p-value for SNP effect
  betas[j] <- summary_model$coefficients[2, 1]     # Estimated beta coefficient
  p_values_adjusted[j] <- summary_adjusted_model$coefficients[2, 4]  # p-value for SNP effect adjusted for ancestry
  betas_adjusted[j] <- summary_adjusted_model$coefficients[2, 1]     # Estimated beta coefficient adjusted for ancestry
}


In [43]:
# Create results table
results <- data.frame(Variant = colnames(X), Beta = betas, P_Value = p_values, 
                      Beta_Adjusted = betas_adjusted, P_Value_Adjusted = p_values_adjusted)
results

Variant,Beta,P_Value,Beta_Adjusted,P_Value_Adjusted
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Variant 1,-0.5000913,0.390901513,-0.2838356,0.66071613
Variant 2,0.8525024,0.066475513,1.2137322,0.224494
Variant 3,0.9866667,0.001844466,0.9461187,0.02057182


The results clearly demonstrate the impact of confounder control in genetic association studies. In the naive analysis without ancestry adjustment, Variant 3 appears highly significant ($p<0.01$), while Variants 1 and 2 show non-significant associations. However, after controlling for ancestry, all variants lose statistical significance - Variant 3's p-value increases dramatically from 0.0018 to 0.0206 (barely significant), while Variants 1 and 2 remain non-significant with even higher p-values. This illustrates how population structure can create spurious associations that disappear once proper confounder control is applied, highlighting the critical importance of ancestry adjustment in preventing false positive discoveries in genetic studies.