# Intuitional Description

A confounder is a variable that **influences both the exposure and outcome independently**, creating a **misleading association** between them that doesn't represent a true causal relationship.

# Graphical Summary

![confounder](./cartoons/confounder.svg)

# Key Formula

The key formula for the concept of a confounder is represented in a causal diagram as:
$$
X ← C → Y
$$
Where:
- $C$ is the confounder variable
- $X$ is the exposure/treatment variable 
- $Y$ is the outcome variable
- The arrows (←, →) indicate the direction of causal influence

This diagram illustrates that a confounder ($C$) has a direct causal effect on both the exposure ($X$) and the outcome ($Y$), creating a "backdoor path" between $X$ and $Y$ that must be blocked to obtain an unbiased estimate of the causal effect.


# Technical Details



Confounding can cause bias in the estimation of the relationship between the genotype and the trait. To avoid this bias, confounders **must be controlled** in the analysis, typically by including them as covariates in regression models.


In formal causal inference terminology, a confounder creates a situation where:
$$
P(Y|X) \neq P(Y|\text{do}(X))
$$
Where do(X) represents an intervention to set X to a specific value. This inequality shows that the observed association differs from the true causal effect due to the confounding variable.

When adjusting for confounders in statistical models:
1. Stratification: Analyzing the $X$-$Y$ relationship separately within strata of C
2. Regression adjustment: $Y = \beta_0 + \beta_1 X + \beta_2 C + \epsilon$
3. Propensity score methods: Creating balanced groups based on $P(X=1|C)$
4. Instrumental variables: Using a variable $Z$ where $Z→X→Y$ and $Z \perp \!\!\! \perp C$
5. Directed Acyclic Graphs (DAGs): Identifying minimal sufficient adjustment sets

The backdoor criterion in causal inference provides a graphical rule for identifying which variables need to be controlled to eliminate confounding bias when estimating causal effects.

# Example

This example demonstrates how to identify and control for confounding in genetic association studies. We create a simple dataset with:

- Genetic variants for 5 individuals
- Height measurements
- Ancestry information (the confounder)

We perform two analyses:

- A naive analysis testing associations between genetic variants and height
- An adjusted analysis controlling for ancestry as a confounder

The example shows how ancestry can create spurious associations between certain genetic variants and height, and how proper statistical adjustment helps reveal the true underlying relationships by blocking this "backdoor path" created by the confounder.

In [35]:
# Clear the environment
rm(list = ls())

# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
  "CC", "CT", "AT", # Individual 1
  "TT", "TT", "AA", # Individual 2
  "CT", "CT", "AA", # Individual 3
  "CC", "TT", "AA", # Individual 4
  "CC", "CC", "TT" # Individual 5
)
# Reshape into a matrix
N <- 5 # number of individuals
M <- 3 # number of variants
geno_matrix <- matrix(genotypes, nrow = N, ncol = M, byrow = TRUE)
rownames(geno_matrix) <- paste("Individual", 1:N)
colnames(geno_matrix) <- paste("Variant", 1:M)

alt_alleles <- c("T", "C", "T")
ref_alleles <- c("C", "T", "A")

# Convert to raw genotype matrix using the additive / dominant / recessive model
Xraw_additive <- matrix(0, nrow = N, ncol = M) # dount number of non-reference alleles

rownames(Xraw_additive) <- rownames(geno_matrix)
colnames(Xraw_additive) <- colnames(geno_matrix)

for (i in 1:N) {
  for (j in 1:M) {
    alleles <- strsplit(geno_matrix[i, j], "")[[1]]
    Xraw_additive[i, j] <- sum(alleles == alt_alleles[j])
  }
}
X <- scale(Xraw_additive, center = TRUE, scale = TRUE)

In [36]:
# assign observed height for the 5 individuals
Y_raw <- c(180, 160, 158, 155, 193)
Y <- scale(Y_raw)

In [37]:
# Now let's introduce a confounder: Ancestry
# We'll create two ancestry groups that correlate with both genotype and phenotype
ancestry <- c(1, 2, 2, 1, 1) # 1=Population A, 2=Population B
ancestry_factor <- as.factor(ancestry)

# Simulate an outcome (height) that's affected by both genotype and ancestry
Y_raw <- c(180, 160, 158, 155, 193)
Y <- scale(Y_raw)

In [38]:
# first we conduct the OLS like we did in the previous section of summary statistics 
# Calculate minor allele frequencies
MAF <- colMeans(Xraw_additive) / 2

# Perform GWAS-style analysis: Test each SNP independently using OLS
sumstats <- data.frame(
    SNP = paste0("rs", 1:M),
    CHR = c(1, 1, 2), # Example chromosome assignments
    BP = c(1000, 2000, 5000), # Example base pair positions
    ALT = alt_alleles, # Effect allele
    REF = ref_alleles, # Reference allele
    N = rep(N, M), # Sample size
    BETA = numeric(M), # Effect size
    SE = numeric(M), # Standard error
    Z = numeric(M), # Z-score
    P = numeric(M), # P-value
    EAF = MAF # Effect allele frequency
)

for (j in 1:M) {
    SNP <- X[, j] # Extract genotype for SNP j
    model <- lm(Y ~ SNP) # OLS regression: Trait ~ SNP
    summary_model <- summary(model)

    # Store results in standard format
    sumstats$BETA[j] <- summary_model$coefficients[2, 1] # Effect size
    sumstats$SE[j] <- summary_model$coefficients[2, 2] # Standard error
    sumstats$Z[j] <- summary_model$coefficients[2, 3] # t-statistic (equivalent to Z-score)
    sumstats$P[j] <- summary_model$coefficients[2, 4] # P-value
}


In [39]:
# Now perform the analysis adjusting for the confounder (ancestry)

# Perform GWAS-style analysis: Test each SNP independently using OLS
sumstats_adjusted <- data.frame(
  SNP = paste0("rs", 1:M),
  CHR = c(1, 1, 2),  # Example chromosome assignments
  BP = c(1000, 2000, 5000),  # Example base pair positions
  ALT = alt_alleles,  # Effect allele
  REF = ref_alleles,  # Reference allele
  N = rep(N, M),  # Sample size
  BETA = numeric(M),  # Effect size
  SE = numeric(M),  # Standard error
  Z = numeric(M),  # Z-score
  P = numeric(M),  # P-value
  EAF = MAF  # Effect allele frequency
)

for (j in 1:M) {
  SNP <- X[, j]  # Extract genotype for SNP j
  model_adjusted <- lm(Y ~ SNP + ancestry_factor)  # OLS regression: Trait ~ SNP + ancestry
  summary_model_adjusted <- summary(model_adjusted)
  
  # Store results in standard format
  sumstats_adjusted$BETA[j] <- summary_model_adjusted$coefficients[2, 1] # Effect size
  sumstats_adjusted$SE[j] <- summary_model_adjusted$coefficients[2, 2] # Standard error
  sumstats_adjusted$Z[j] <- summary_model_adjusted$coefficients[2, 3] # t-statistic (equivalent to Z-score)
  sumstats_adjusted$P[j] <- summary_model_adjusted$coefficients[2, 4] # P-value
}


In [40]:
# Print summary statistics in standard format (before adjusting for ancestry)
print("GWAS Summary Statistics:")
sumstats
# Print summary statistics in standard format (after adjusting for ancestry)
print("GWAS Summary Statistics after considering genetic ancestry:")
sumstats_adjusted

[1] "GWAS Summary Statistics:"


Unnamed: 0_level_0,SNP,CHR,BP,ALT,REF,N,BETA,SE,Z,P,EAF
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Variant 1,rs1,1,1000,T,C,5,-0.5000913,0.49996955,-1.000244,0.390901513,0.3
Variant 2,rs2,1,2000,C,T,5,0.8525024,0.30179448,2.824778,0.066475513,0.4
Variant 3,rs3,2,5000,T,A,5,0.9866667,0.09396605,10.500246,0.001844466,0.3


[1] "GWAS Summary Statistics after considering genetic ancestry:"


Unnamed: 0_level_0,SNP,CHR,BP,ALT,REF,N,BETA,SE,Z,P,EAF
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Variant 1,rs1,1,1000,T,C,5,0.1081279,1.4766481,0.0732252,0.94829123,0.3
Variant 2,rs2,1,2000,C,T,5,0.7484682,0.3201662,2.3377489,0.14438001,0.4
Variant 3,rs3,2,5000,T,A,5,1.0272146,0.1378365,7.4524131,0.01753339,0.3
